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COMMUNICATION  AVOIDING  RANK  REVEALING  QR 
FACTORIZATION  WITH  COLUMN  PIVOTING 

JAMES  W.  DEMMEL* * * §,  LAURA  GRIGORlt,  MING  GU  *,  AND  HUA  XIANG  § 

Abstract.  In  this  paper  we  introduce  CARRQR,  a  communication  avoiding  rank  revealing  QR 
factorization  with  tournament  pivoting.  We  show  that  CARRQR  reveals  the  numerical  rank  of  a 
matrix  in  an  analogous  way  to  QR  factorization  with  column  pivoting  (QRCP).  Although  the  upper 
bound  of  a  quantity  involved  in  the  characterization  of  a  rank  revealing  factorization  is  worse  for 
CARRQR  than  for  QRCP,  our  numerical  experiments  on  a  set  of  challenging  matrices  show  that  this 
upper  bound  is  very  pessimistic,  and  CARRQR  is  an  effective  tool  in  revealing  the  rank  in  practical 
problems. 

Our  main  motivation  for  introducing  CARRQR  is  that  it  minimizes  data  transfer,  modulo  poly- 
logarithmic  factors,  on  both  sequential  and  parallel  machines,  while  previous  factorizations  as  QRCP 
are  communication  sub-optimal  and  require  asymptotically  more  communication  than  CARRQR. 
Hence  CARRQR  is  expected  to  have  a  better  performance  on  current  and  future  computers,  where 
commmunication  is  a  major  bottleneck  that  highly  impacts  the  performance  of  an  algorithm. 

Key  words.  QR  factorization,  rank  revealing,  column  pivoting,  minimize  communication 

AMS  subject  classifications.  65F25,  65F20 


1.  Introduction.  Revealing  the  rank  of  a  matrix  is  an  operation  that  appears 
in  many  important  problems  as  least  squares  problems,  low  rank  approximations, 
regularization,  nonsymmetric  eigenproblems  (see  for  example  [8]  and  the  references 
therein).  In  this  paper  we  focus  on  the  rank  revealing  QR  factorization  [8,  7,  16], 
which  computes  a  decomposition  of  a  matrix  A  £  Rmxra  0f  the  form 


An  =  QR  =  Q 


Rll  R12 
R-22 


(1.1) 


where  Q  £  Rmxm  is  ortligonal,  i?n  £  Rkxk  is  upper  triangular,  R12  £  Rkx(n~k\ 
and  R22  €  R(m-fc)x(n-fe).  The  column  permutation  matrix  II  and  the  integer  k  are 
chosen  such  that  H-R22II9  is  small  and  i?n  is  well-conditioned.  This  factorization  was 
introduced  in  [16],  and  the  first  algorithm  to  compute  it  was  proposed  in  [6]  and 
is  based  on  the  QR  factorization  with  column  pivoting  (QRCP).  A  BLAS-3  version 
of  this  algorithm  [25]  is  implemented  in  LAPACK  [1],  and  its  parallel  version  in 
ScaLAPACK  [5]. 
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The  performance  of  an  algorithm  is  highly  impacted  by  the  amount  of  commu¬ 
nication  performed  during  its  execution,  where  communication  refers  to  both  data 
transferred  between  different  levels  of  the  memory  hierarchy  of  a  processor  or  be¬ 
tween  different  processors  of  a  parallel  computer.  Research  performed  in  the  recent 
years  has  shown  that  most  of  the  classic  algorithms  in  direct  dense  linear  algebra 
transfer  more  data  than  lower  bounds  on  communication  indicate  is  necessary,  and 
new,  communication  optimal  algorithms  can  and  should  be  developed.  In  this  con¬ 
text,  the  goal  of  this  paper  is  to  design  a  pivoted  QR  factorization  that  is  effective  in 
revealing  the  rank  of  a  matrix  but  also  minimizes  communication,  on  both  sequential 
and  parallel  machines. 

There  are  several  different  definitions  for  determining  when  the  factorization  from 
equation  (1.1)  reveals  the  rank  (see,  for  example,  [15]),  one  of  them  [22,  9]  says  that 
the  factorization  from  equation  (1.1)  is  a  rank  revealing  QR  factorization  (RRQR)  if 


^mm(^  11 )  'll 


Pfc(A) 

p{k,  n) 


and  (Jmax{R22)  <  crk+1(A)p(k,n), 


(1.2) 


where  crmin{A)  and  o-max(A)  are  the  smallest  and  the  largest  singular  values  of  A 
respectively,  and  p{k,n)  is  a  low  degree  polynomial  in  n  and  k.  Since  the  ( k  +  l)-th 
largest  singular  value  ak+ i  <  crmax(-R22)  =  H-R22II2  and  HR22II2  is  small,  then  A  can 
be  considered  to  have  numerical  rank  k.  The  first  k  columns  Q(:,  1  :  k)  form  an 


approximate  orthogonal  basis  for  the  range  of  A  and  II 


R\\  R\2 
-J 


are  approximate 


null  vectors. 

Given  a  rank  k  and  a  parameter  /  >  1,  it  is  shown  in  [20]  that  there  exists  a 
permutation  II  such  that  the  factorization  displayed  in  equation  (1.1)  satisfies  the 
inequality 


(R^R12)lj+ojf  (i?n)  7j  (R22)  <  f,  (1.3) 

where  Rtj  is  the  element  in  position  (i,  j)  of  R.  uJt(R-[  1 )  denotes  the  2-norm  of  the 
i-th  row  of  R11,  and  77(^22)  denotes  the  2-norm  of  the  j-th  column  of  i?22-  This 
factorization  is  called  a  strong  RRQR  factorization,  and  is  more  powerful  than  the 
classic  QR  factorization  with  column  pivoting  which  only  guarantees  that  /  =  0(2"). 
A  strong  RRQR  factorization  is  computed  by  performing  first  a  QR  factorization  with 
column  pivoting  followed  by  additional  swaps  of  columns.  In  Section  2,  we  discuss  in 
more  detail  the  characterization  of  a  strong  rank  revealing  QR  factorization,  as  well 
as  more  relaxed  versions  of  the  bounds  from  equation  (1.3). 

In  practice  the  QR  factorization  with  column  pivoting  often  works  well,  and  it 
is  widely  used  even  if  it  is  known  to  fail,  for  example  on  the  so  called  Kalian  matrix 
that  we  describe  in  more  detail  in  section  4.  However  in  terms  of  communication,  the 
QR  factorization  with  column  pivoting  is  sub-optimal  with  respect  to  lower  bounds 
on  communication  identified  in  [3]  (under  certain  assumptions  in  the  case  of  the  QR 
factorization).  If  the  algorithm  is  performed  in  parallel,  then  typically  the  matrix 
is  distributed  over  P  processors  by  using  a  two-dimensional  block  cyclic  partition- 
ning.  This  is  indeed  the  approach  used  in  the  psgeqpf  routine  from  ScaLAPACK.  At 
each  step  of  the  decomposition,  the  QR  factorization  with  column  pivoting  finds  the 
column  of  maximum  norm  and  permutes  it  to  the  leading  position,  and  this  requires 
exchanging  0(n)  messages,  where  n  is  the  number  of  columns  of  the  input  matrix.  For 
square  matrices,  when  the  memory  per  processor  used  is  on  the  order  of  0(n2  /  P),  the 
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lower  bound  on  the  number  of  messages  to  be  exchanged  is  f 1(V~P).  The  number  of 
messages  exchanged  during  the  QR  factorization  with  column  pivoting  is  larger  by  at 
least  a  factor  oin/y/P  than  the  lower  bound.  When  QRCP  is  executed  on  a  sequential 
machine  with  a  fast  memory  of  size  M  and  a  slow  memory,  then  the  volume  of  data 
transferred  between  fast  and  slow  memory  is  on  the  order  of  0(n3),  and  the  number 
of  messages  is  at  least  0(n3/M).  The  lower  bound  on  the  volume  of  communication 
and  the  number  of  messages  is  0(n3/Af1/2),  fi(n3/M3/2)  respectively.  We  note  that 
the  classic  QR  factorization  with  no  pivoting  in  which  each  column  is  annihilated  by 
using  one  Householder  transformation  is  also  sub-optimal  in  terms  of  communication. 
A  communication  optimal  algorithm  (modulo  polylogarithmic  factors),  referred  to  as 
communication  avoiding  QR  (CAQR),  has  been  introduced  in  [10,  11]. 

In  this  paper  we  introduce  CARRQR,  a  communication  optimal  (modulo  polylog¬ 
arithmic  factors)  RRQR  factorization  based  on  tournament  pivoting.  The  factoriza¬ 
tion  is  based  on  an  algorithm  that  computes  the  decomposition  by  blocks  of  b  columns 
(panels).  For  each  panel,  tournament  pivoting  proceeds  in  two  steps.  The  first  step 
aims  at  identifying  a  set  of  b  candidate  pivot  columns  that  are  as  well-conditioned 
as  possible.  These  columns  are  permuted  to  the  leading  positions,  and  they  are  used 
as  pivots  for  the  next  b  steps  of  the  QR  factorization.  To  identify  the  set  of  b  can¬ 
didate  pivot  columns,  a  tournament  is  performed  based  on  a  reduction  operation, 
where  at  each  node  of  the  reduction  tree  b  candidate  columns  are  selected  by  using 
the  strong  rank  revealing  QR  factorization.  The  idea  of  tournament  pivoting  has  been 
first  used  to  reduce  communication  in  Gaussian  elimination  [18,  19],  and  then  in  the 
context  of  a  newly  introduced  LU  factorization  with  panel  rank  revealing  pivoting  [24] . 
CARRQR  is  optimal  in  terms  of  communication,  modulo  polylogarithmic  factors,  on 
both  sequential  machines  with  two  levels  of  slow  and  fast  memory  and  parallel  ma¬ 
chines  with  one  level  of  parallelism,  while  performing  three  times  more  floating  point 
operations  than  QRCP.  We  expect  that  on  computers  where  communication  is  the 
bottleneck,  CARRQR  will  be  faster  than  other  algorithms  as  QRCP  which  do  not 
minimize  communication.  We  believe  that  large  speedups  can  be  obtained  on  future 
computers  (if  not  the  present  for  sufficiently  large  matrices)  where  communication 
plays  an  increasingly  important  role  for  the  performance  and  parallel  scalability  of  an 
algorithm. 

We  show  that  CARRQR  computes  a  permutation  that  satisfies 

(R^R^)  l-  +  (7 j  (R22)  /w  (Rn))2  <  F2,  (1.4) 

where  F  is  a  constant  dependent  on  k ,  /,  and  n.  Equation  (1.4)  looks  very  similar 
to  equation  (1.3),  and  reveals  the  matrix  numerical  rank  in  a  completely  analogous 
way  (see  Theorems  2.2  and  2.4).  While  our  upper  bound  on  F  is  super-exponential 
in  n  (see  Theorem  2.10),  our  extensive  experiments,  including  those  on  challenging 
matrices,  show  that  this  upper  bound  is  very  pessimistic  in  general  (see  Section  4). 
These  experiments  demonstrate  that  CARRQR  is  as  effective  as  QR  with  column 
pivoting  in  revealing  the  rank  of  a  matrix.  For  the  cases  where  QR  with  column 
pivoting  does  not  fail,  CARRQR  also  works  well,  and  the  values  on  the  diagonal  of 
the  R  factor  are  very  close  to  the  singular  values  of  the  input  matrix  computed  with 
the  highly  accurate  routine  DGESVJ  [13,  14].  The  matrices  in  our  set  were  also  used 
in  previous  papers  discussing  rank  revealing  factorizations  [4,  20,  26,  23]. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  presents  the  algebra  of 
CARRQ  and  shows  that  it  is  a  rank  revealing  factorization  that  satisfies  equation 
(1.4).  Section  3  analyzes  the  parallel  and  sequential  performance  of  CARRQR  and 
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discusses  its  communication  optimality.  Section  4  discusses  the  numerical  accuracy 
of  CARRQR  and  compares  it  with  QRCP  and  the  singular  value  decomposition. 
Section  5  outlines  how  tournament  pivoting  can  be  extended  to  other  factorizations  as 
Cholesky  with  diagonal  pivoting,  LU  with  complete  pivoting,  or  LDLT  factorization 
with  pivoting.  Finally,  section  6  concludes  our  paper. 

2.  Rank  revealing  QR  factorization  with  tournament  pivoting.  This  sec¬ 
tion  presents  the  algebra  of  the  QR  factorization  algorithm  based  on  a  novel  pivoting 
scheme  referred  to  as  tournament  pivoting  and  analyzes  its  numerical  stability.  We 
refer  to  this  algorithm  as  CARRQR. 

2.1.  The  algebra.  We  consider  a  block  algorithm  that  partitions  the  matrix  A 
of  size  mxn  into  panels  of  size  b.  In  classic  QR  factorization  with  column  pivoting,  at 
each  step  i  of  the  factorization,  the  remaining  unselected  column  of  maximum  norm 
is  selected  and  exchanged  with  the  i- th  column,  its  subdiagonal  elements  are  annihi¬ 
lated,  using  for  example  a  Householder  transformation,  and  then  the  trailing  matrix  is 
updated.  A  block  version  of  this  algorithm  is  described  in  [25].  The  main  difficulty  in 
reducing  communication  in  rank  revealing  QR  factorization  lies  in  identifying  b  pivot 
columns  at  each  step  of  the  block  algorithm.  Our  communication  avoiding  algorithm, 
CARRQR,  is  based  on  tournament  pivoting,  and  uses  a  reduction  operation  on  blocks 
of  columns  to  identify  the  next  b  pivot  columns  at  each  step  of  the  block  algorithm. 
This  idea  is  analogous  to  the  reduction  operation  used  in  CALU  [18]  to  identify  the 
next  b  pivot  rows.  The  operator  used  at  each  node  of  the  reduction  tree  is  a  RRQR 
factorization.  Our  theoretical  analysis  presented  in  section  2.3  is  general  enough  to 
account  for  any  kind  of  RRQR  factorization,  from  classical  column  pivoting  to  the 
“strong”  RRQR  factorization  in  [20],  and  gives  tighter  bound  if  a  strong  RRQR  fac¬ 
torization  is  used.  The  numerical  experiments  from  section  4  will  show  that  using 
CARRQR  is  adequate  in  practice,  and  indeed  much  better  than  the  bounds  derived 
in  this  section.  This  is  somewhat  similar  to  using  traditional  column  pivoting  for  the 
overall  factorization:  the  average  case  is  much  better  than  the  worst  case,  although 
the  worst  case  can  occur  with  traditional  column  pivoting. 

To  illustrate  tournament  pivoting,  we  consider  an  m-by-n  matrix  A.  We  use  a 
binary  reduction  tree  and  operate  on  blocks  of  bx  columns,  so  there  are  n/bx  such 
blocks.  In  our  example  bx  =  n/ 4,  so  A  is  partitioned  as  A  =  [A00,  Aw,  A2o,  A30]- 
Hence  our  communication  avoiding  algorithm  has  two  parameters,  b  which  determines 
the  panel  size  and  bx  which  determines  the  height  of  the  reduction  tree.  These  two 
parameters  will  be  discussed  in  more  detail  in  section  3. 

Tournament  pivoting  starts  by  computing  a  (strong)  RRQR  factorization  of  each 
column  block  A.i o  to  identify  b  column  candidates, 


AioIIio  —  Qio 


* 

* 


for  i  =  0  to  3, 


where  H;o  are  permutation  matrices  of  size  bx  x  bx,  Qio  are  orthogonal  matrices  of 
size  m  x  m,  and  Rio  are  upper  triangular  matrices  of  size  in  x  b.  The  first  subscript  i 
indicates  the  column  block  of  the  matrix,  while  the  second  subscript  0  indicates  that 
the  operation  is  performed  at  the  leaves  of  the  binary  reduction  tree. 

At  this  stage  we  have  n/bx  sets  of  b  column  candidates.  The  final  b  columns  are 
selected  by  performing  a  binary  tree  (of  depth  log2  (n/bx)  =  2)  of  (strong)  RRQR 
factorizations  of  matrices  of  size  m  x  2b.  At  the  first  level  of  the  binary  tree,  we  form 
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two  matrices  by  putting  together  consecutive  sets  of  column  candidates. 


An  —  [(AoXlooXb  1  :  6),  (Ai0IIio)(:,  1  :  b)] 
-A  11  =  [(AoAoX:,  1  :  b),  (AoXXoX:,  1  :  b)] 


From  each  matrix  we  select  a  new  set  of  b  column  candidates  by  again  computing  a 
(strong)  RRQR  factorization: 


AiAi  —  Qn 


R, 


i  1 


for  i  =  0  to  1, 


where  IIji  are  permutation  matrices  of  size  2 6  x  26,  Qa  are  orthogonal  matrices  of 
size  to  x  to,  and  R%\  are  upper  triangular  matrices  of  size  6x6. 

At  the  second  (and  last)  level  of  the  binary  tree,  the  two  sets  of  6  column  candi¬ 
dates  from  the  first  level  are  combined  into  a  matrix  A2, 


Aq2  —  [(AiIIoi)(:,  1  :  6),  (AnIIii)(:,  1  :  6)]. 


The  final  6  columns  are  obtained  by  performing  one  last  (strong)  RRQR  factorization 

Of  Aq2- 


A02II02  —  Q  02 


R02 


where  II02  is  a  permutation  matrix  of  size  26  x  26,  Q 02  is  an  orthogonal  matrix  of 
size  to  x  m,  and  R02  is  an  upper  triangular  matrix  of  size  6x6:  The  final  6  columns 
selected  are  A^IIcA:,  1  :  6). 

The  matrices  II y  ,  i  =  0,1  and  j  =  1,2,  are  partitioned  into  four  blocks  of  size 
6  x  6  as 


fn(1) 

n^2)l 

II 

a 

1 

J3J 

1 

Let  II ij,  i  =  0, 1  and  j  =  1,  2,  be  permutation  matrices  obtained  by  extending  11,^ 
with  identity  matrices, 


H  ij  — 


(i) 

ij 

(3) 


(2) 

ij 

(4) 


where  r  =  n/P  —  b  for  n0i,  JT| j,  and  r  =  n/ 2  —  6  for  n02.  The  tournament  pivoting 
process  can  be  expressed  as 


A 


n. 


00 


n 


10 


n. 


20 


n 


30 


IX 


01 


n, 


XI02  —  Q  02 


R02  * 

* 


In  other  words,  the  factorization  performed  at  the  root  of  the  binary  tree  corresponds 
to  the  factorization  of  the  first  panel  of  the  permuted  matrix.  The  algorithm  updates 
the  trailing  matrix  using  Q 02  and  then  goes  to  the  next  iteration. 
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Different  reduction  trees  can  be  used  to  perform  tournament  pivoting.  The  binary 
tree  is  presented  in  the  following  picture  using  an  arrow  notation.  At  each  node  of  the 
reduction  tree,  /(A^)  returns  the  first  b  columns  obtained  after  performing  (strong) 
RRQR  of  Ajj.  The  input  matrix  Arj  is  obtained  by  adjoining  the  input  matrices  (on 
the  other  ends  of  the  arrows  pointing  towards  /(A,y)). 

Aoo  Aio  A20  A30 

■'l'  ■'I*  ■'l'  ■l" 

f{A 00)  /(A10)  /(A2o)  /(A30) 

\  x7  \  x7 

/(An)  /(An) 

\  x7 
f(A  02) 

A  flat  tree  is  presented  using  this  arrow  notation  in  the  following  picture. 

Aqo  A10  A20  A30 


/(A3) 

2.2.  Selecting  the  first  b  columns  from  (strong)  rank  revealing  QR  fac¬ 
torization.  The  (possibly  strong)  rank  revealing  QR  factorization  of  a  matrix  A^  is 
used  in  tournament  pivoting  at  each  node  of  the  reduction  tree  to  select  b  candidate 
columns.  Suppose  that  Atj  is  of  dimension  m\  x  n\.  When  the  matrix  fits  in  fast 
memory  on  a  sequential  processor,  the  b  column  candidates  may  be  selected  by  com¬ 
puting  the  first  b  steps  of  QR  with  column  pivoting.  When  a  strong  rank  revealing 
factorization  is  employed,  several  supplementary  operations  and  swaps  are  performed, 
as  explained  in  [20]. 

When  the  matrix  does  not  fit  in  fast  memory,  or  it  is  distributed  over  several  pro¬ 
cessors,  the  b  candidate  columns  are  selected  by  first  computing  the  QR  factorization 
of  Ay  without  pivoting,  and  then  the  (strong)  QR  factorization  of  the  much  smaller 
(2&-by-2&)  R  factor.  The  first  QR  factorization  is  performed  using  communication 
avoiding  QR  [11]  for  tall  and  skinny  matrices,  referred  to  as  TSQR,  which  minimizes 
communication. 

2.3.  Why  QR  with  tournament  pivoting  reveals  the  rank.  In  this  section, 
we  first  recall  the  characterization  of  a  (strong)  RRQR  factorization  from  [20],  modify 
it  slightly,  and  then  show  (with  appropriate  choice  of  bounds)  that  it  describes  the 
result  of  tournament  pivoting.  The  characterization  depends  on  the  particular  rank  k 
chosen.  Subsection  2.3.1  analyzes  the  case  k  =  6,  i.e.  the  result  of  a  single  tournament. 
Then  subsection  2.3.2  extends  the  analysis  to  any  1  <  k  <  min (m,n),  i.e.  the  final 
output  of  the  algorithm. 
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To  set  up  the  notation  needed  to  explain  [20],  let  0Ji(X)  denote  the  2-norm  of  i-th 
row  of  X _1,  and  let  7 j(X)  denote  the  2-nonn  of  the  j- th  column  of  X. 

Theorem  2.1.  (Gu  and  Eisenstat  [20])  Let  B  be  an  m  x  n  matrix  and  let 
1  <  k  <  min (m,  n).  For  any  given  parameter  f  >  1,  there  exists  a  permutation  II 
such  that 


BU  =  Q 


R 11  R12 

R22 


where  Rn  is  k  x  k  and 


(Rn'Ru)2^  +  ovf  (Rn)  l)  (R22)  <  f2- 


(2.1) 


The  factorization  in  Theorem  2.1  can  be  computed  in  Ofmnk)  flops.  Inequality  (2.1) 
is  important  because  it  implies  bounds  on  the  singular  values  of  Rn  and  R22' 

Theorem  2.2.  (Gu  and  Eisenstat  [20])  Let  the  factorization  in  Theorem  2.1 
satisfy  inequality  (2.1).  Then 


&i{B)  (J j  (R22) 
^i(-Rn)’  <?k+j{B) 


<  V1  +  f2k(n  -  k), 


for  any  1  <  i  <  k  and  1  <  j  <  min(rn,  n)  —  k. 

In  particular,  Theorem  2.2  shows  that  the  QR  factorization  in  Theorem  2.1  reveals 
the  rank  in  the  sense  that  the  singular  values  of  Rn  are  reasonable  approximations 
of  the  k  largest  singular  values  of  B ,  and  the  singular  values  of  R22  are  reasonable 
approximations  of  the  min(ro,  n)  —  k  smallest  singular  values  of  B.  We  call  this  a 
strong  rank  revealing  factorization  because  the  bound  in  Theorem  2.2  grows  like  a 
low  degree  polynomial  in  n.  In  contrast,  traditional  column  pivoting  only  guarantees 
that  /  =  0(2n).  Still,  traditional  column  pivoting  often  works  well  in  practice,  and 
we  will  use  it  as  a  tool  in  our  numerical  experiments  later. 

To  analyze  our  communication-avoiding  RRQR  algorithm  in  a  more  convenient 
way,  we  consider  the  following  relaxed  version  of  Theorem  2.1. 

Corollary  2.3.  Let  B  be  an  m  x  n  matrix  and  let  1  <  k  <  min(r?z,  n).  For  any 
given  parameter  f  >  1,  there  exists  a  permutation  II  such  that 


bii  =  q 


Rn  R12 
R22 


where  Rn  is  k  x  k  and 

yjij  (R11R12)  +  {lj  (-R22)/o-min(-Rn))2  <  fVk  for  j  =  !,-■■  ,n~  k.  (2.2) 


The  proof  is  immediate,  as  the  permutation  II  of  Theorem  2.1  automatically 
satisfies  inequality  (2.2).  Below  is  the  analogue  of  Theorem  2.2  based  on  Corollary  2.3. 

Theorem  2.4.  Assume  that  there  exists  a  permutation  II  for  which  the  QR 
factorization 


bh  =  q 


Rn  R12 
R22 


where  Rn  is  k  x  k,  satisfies 

(^12)  +  (7 j  {R22)  Mnint-Rii))2  <  F  for  j  =  1,  ■  ■  ■  ,n  —  k.  (2.3) 
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Then 


for  any  1  <  i  <  k  and  1  <  j  <  min (rn,  n)  —  k. 

Remark:  By  Corollary  2.3,  there  exists  a  permutation  II  for  which  inequality  (2.3) 
is  satisfied  with  F  =  f\fk.  For  this  permutation  and  with  this  choice  of  F,  Theo¬ 
rem  2.4  gives  the  same  singular  value  ratio  bounds  as  those  in  Theorem  2.2.  However, 
Theorem  2.4  will  prove  much  more  convenient  for  our  subsequent  analysis  and  could 
provide  a  better  practical  bound  on  the  singular  values  as  we  can  typically  expect  F 
to  be  much  smaller  than  fVk. 

Proof  of  Theorem  2.4:  Define 

_  0-max(f?22) 


i(-Rll) 


Then  we  can  rewrite 


BU  =  Q 


I  Rf}R 


11  ^12 
al 


It  follows  that 


I  Rf}R 


11  ^12 
Oil 


i(B)< 


On  the  other  hand,  the  largest  k  singular  values  of 

those  of  the  matrix  Rn,  and  the  2-norm  of  the  matrix 
above  by 


i=  ,  k. 


I  i?111I?i2 


are  precisely 


is  bounded 


'l  +  \\Rf1R12Wl  +  a2<  yj\  +  F2{n-k). 


In  other  words, 


CTi(Rll) 

for  ,  k.  Conversely,  observe  that 


<  y/l  +  F2  (n  —  k ) 


Rn  R\2 

R-22 


Otl  — R\1  R\2 
I 


where  the  smallest  min(m,  n)  —  k  singular  values  of 


values  of  R22,  and  the  2-norm  of  the  matrix 
bounded  as 


Oil  — R11  Rl2 
I 


are  the  singular 


can  be  similarly 


VI  +  \\Rf1R12Wl  +  a2  <  Vl  +F2(n-k). 

This  again  leads  to 

^  —  \/l  +  F2(n  —  k ),  j  =  1,  •  •  •  ,  min(m,  n)  —  k. 

ak+j\B) 

This  completes  the  proof.  □ 


2.3.1.  Analyzing  one  tournament  step.  Next  we  apply  Theorem  2.4  to  an¬ 
alyze  one  tournament  step:  Given  a  subset  of  6  columns  of  matrix  B  that  reveals  its 
rank  (for  k  =  b),  and  another  subset  of  b  columns  that  reveals  the  rank  of  a  different 
matrix  B ,  we  will  show  that  computing  a  rank-revealing  decomposition  of  just  these 
2b  columns  of  B  and  B  will  provide  b  columns  that  reveals  the  rank  of  the  combined 
matrix  [B,  B]  (see  Lemma  2.5).  Then  we  will  use  this  as  an  induction  step  to  analyze 
an  entire  tournament,  once  using  a  binary  tree  (Corollary  2.6)  and  once  with  a  flat 
tree  (Corollary  2.7). 

We  use  the  following  notation  to  denote  the  rank-revealing  factorizations  of  B 
and  B : 


bh  =  q 

Rll  R12 

and  BIl  =  Q 

R11  R12 

R22 

R22 

where  II  and  II  are  permutation  matrices;  Ru  and  Rn  are  6x6  upper-triangular 
matrices.  We  assume  that  the  factorizations  in  equation  (2.4)  satisfy 


7.j(^)2+7i(^22)2/CTmin(i?ii)2  <  F2 ,  ^  (N)2  +  ^  (R22)2  /  amin  ( R n)2  <  F2 ,  (2.5) 

where  N  =  R^Ri 2  and  N  =  R^R\2-  For  example,  by  Theorem  2.1,  we  can  choose 
F  =  F  =  f\fb  when  B  and  B  each  consists  of  26  columns;  later  on  we  will  use  this 
fact  as  the  basis  of  our  induction.  Next  we  develop  similar  bounds  for  the  combined 
matrix  B  =  B  B  . 

Tournament  pivoting  continues  by  computing  a  (strong)  RRQR  factorization  of 
the  form 


where 


Let 


BIl  d= 

Q 

Ru 

,  Q 

R11 

tl  =  Q 

Rll  R12 
R22 

(tfu1^).  .  +  u2  (-Rn)  7 j  (-R22)  <  /2- 


n  = 


n 


be  the  accumulation  of  all  permutations,  then  we  can  write  B  as 

R\2 


b  n  = 


0 


=  Q 


=fQ 


Rll  R12 
R22 

Rll  R12 
R22 

Ru  R12 
R22 


Q 


R 


22 


,  Q 


Q  Q 


Rl2 

R22 


R-12 

R22 

qtq 


R12 

R22 


(2.6) 


(2.7) 


(2.8) 
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Our  goal  is  to  derive  bounds  analogous  to  (2.5)  for  equation  (2.8).  To  this  end, 
we  need  to  first  identify  the  matrices  i?i2  and  Rii-  Note  that 


=  qtq 


RnN 

R22 


=  qtq 


N  +  Q  Q 


R22 


Continuing,  we  may  write 


RuM 

C 


where  the  b  x  b  matrices  A f  and  C  are  defined  as  follows:  for  each  1  <  t  <  b,  the 
f-th  column  of  the  matrix  on  the  left  hand  side  of  equation  (2.9)  must  be  some  s-th 

n  f  R11  R12  1  If  s  <  b,  then  we  set  the  t-th  column  of  C  to  be  0,  and 


column  of  Q 


the  t- th  column  of  N  to  be  all  0  except  the  s-th  entry,  which  will  be  1.  On  the  other 
hand,  if  s  >  b,  then  we  set  the  i-th  columns  of  C  and  AT  to  be  the  (s  —  &)-th  columns 
of  R22  and  R11R12,  respectively.  Since  /  >  1,  we  must  have  by  inequality  (2.7)  that 


A r?j+Wi  (i?n)27|(C)</2 


for  all  1  <  i,j  <  b.  With  the  additional  notation 


def  Cl 

C2 


we  can  further  rewrite 


R12  1  I"  RuAf 


N  +  Q  Q 


(A fN  +  R~11C1 
CN  +  C2 


Similarly,  define  matrices  A f,C,  C\  and  C2  so  that 


Rn  (A Tn  +  R^Ci 
CN  +  C2 


All  this  algebra  has  finally  allowed  us  to  identify 


(A fN  +  RiiC^j  Ru  (A 'fN  + 


i?i2  —  R12  Ru  (a fN  +  R-^iCi^ 

R22  —  R'22  CN  •  C2  CN  -(-  C2 


Below,  we  derive  bounds  analogous  to  (2.5)  for  equation  (2.8).  We  do  this  ac¬ 
cording  to  the  1x3  partition  in  R12  and  i?22-  By  inequality  (2.7),  we  have 

7j  (Ry 1R12)  +7f  {R22) 2  /alin  (Ru)  <  bf. 


In  addition,  we  have 


7,2  (A fN  +  fl^Ci)  +  72  (CN  +  C2)  /a2min  (fln)  (2.13) 

J\  _  1  [  1\ 

i  y  CN/ <Jmin  C2/ O'm.in  J 

-2(^(  CN/a^nfiu)  )+7j?(  C2/amin  (flu)  )) 

<  2  ^  c/amin  (flu)  +Tj  C2  /CTmin  (^n)  j  '  (2'14^ 

It  follows  from  inequality  (2.10)  that 

r  Af  1  2 

C/am in  (flu)  ^  -  f2b  ’ 

and  it  follows  from  definition  (2.11)  that  7 2  ^  ^  ^  =  7?  (fl22).  Furthermore, 

equation  (2.9)  can  be  rewritten  as 

flJifln  =AfTR[1R11Af  +  CTC, 

which  implies  that 

^min  (-Rll)  <  ^min  (-^ll)  ll-A/’H^  +  \\C\\2 

-  (*■■)  (  C/„J(fl..)  J  s  (*”)  • 


which  in  turn  leads  to 

V^min  (-Rll)  <  /2^/0-min  (#ll)  • 

Plugging  all  these  results  into  equation  (2.14),  we  obtain  an  upper  bound  on  (2.13): 

72  (AW  +  flu'Ci)  +  72  (CAT  +  C2)  /<£*  (flu)  <  2/V  (7?(1V)  +  72  (flaa)  /^in  (An)) 

<  2f2b2F2. 

Similarly,  we  can  derive  an  upper  bound 

72  (-AW  +  +  72  (CIV  +  C2)  /a2min  (flu)  <  2/Vi?2. 

All  these  results  are  now  summarized  in  the  following  lemma. 

Lemma  2.5.  Suppose  we  are  given  two  rank-revealing  QR  factorizations  of  B  and 
B  as  in  equation  (2.f),  that  reveal  their  ranks  as  described  in  equation  (2.5).  Suppose 
we  perform  another  rank- revealing  QR  factorization  of  the  b  selected  columns  of  B 
and  b  selected  columns  of  B,  as  described  in  equations  (2.6)  and  (2.7).  Then  the  b 
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columns  selected  by  this  last  QR  factorization  let  us  perform  a  QR  factorization  of 
B  =  [ B,B ]  as  described  in  equation  (2.8),  that  reveals  the  rank  of  B  with  the  bound 


+  7j2  (^22)  /77n  (#11)  <  V2fb  max  (V,  f)  . 


(2.15) 


We  may  use  Lemma  2.5  as  the  induction  step  to  analyze  the  result  of  any  tourna¬ 
ment,  with  any  reduction  tree.  We  consider  two  cases:  a  binary  reduction  tree,  and  a 
flat  reduction  tree,  both  applied  to  an  m  x  n  matrix  with  m  >  n. 

In  the  case  of  a  complete  binary  tree,  we  can  let  B  and  B  each  contain  2h  blocks 
of  b  columns  each,  where  1  <  h  <  log 2(n/b)  —  1.  Assume  that  both  B  and  B  satisfy 
relations  (2.5)  with  F  =  F  =  Fff  (the  superscript  B  stands  for  “binary  tree”).  By 
Corollary  2.3,  for  the  base  case  we  can  set  Ff3  =  f\fb.  Lemma  2.5  yields  the  following 
recursive  relation 


Wf+i)  <  V2fbF*. 


Solving  this  recursion  yields 

F*+1  ~  71  (^/5)  '  (2'16) 

For  h  =  log 2(n/b)  —  1,  i.e.  after  the  entire  tournament  has  completed,  we  get 

Corollary  2.6.  Selecting  b  columns  of  the  mxn  matrix  A  using  QR  with  tour¬ 
nament  pivoting,  with  a  binary  tree,  reveals  the  rank  of  A  in  the  sense  of  Theorem  2 .4 
for  k  =  b,  with  bound 


rpB  _ 

log2(n/h)  — 


log2(ra/fc) 


1  (n/&)loS2(C2/6) 

72  b 


(2.17) 


The  bound  in  (2.17)  can  be  regarded  as  a  low  degree  polynomial  in  n  in  general 
for  a  fixed  b  and  /.  Note  that  the  upper  bound  is  a  decreasing  function  of  b  when 

b  >  yJn/(V2f). 

In  the  case  of  a  flat  tree,  we  may  let  B  and  B  contain  h  blocks  and  1  block  of 
b  columns,  respectively,  where  1  <  h  <  n/b  —  1.  Assume  that  both  B  and  B  satisfy 
relations  (2.5)  with  F  =  F;f  and  F  =  Ff  =  0,  respectively  (the  superscript  F  stands 
for  “flat  tree”).  Lemma  2.5  now  yields  the  following  recursive  relation 


(F7+1)  <  72 fbF[. 


Solving  this  recursion  yields 

F^+1  ~  7 Tb  fb 

For  h  =  n/b  —  1,  i.e.  after  the  entire  tournament  has  completed,  we  get 

Corollary  2.7.  Selecting  b  columns  of  the  mxn  matrix  A  using  QR  with 
tournament  pivoting,  with  a  flat  tree,  reveals  the  rank  of  A  in  the  sense  of  Theorem  2 .4 
for  k  =  b,  with  bound 

I  /  \  n/b 

F" s  71 77  ■  (2-19) 


h+1 

(2.18) 


Bound  (2.19)  is  exponential  in  n,  pointing  to  the  possibility  that  the  QR  factor¬ 
ization  we  compute  might  not  quite  reveal  the  rank  in  extreme  cases.  But  note  that 
the  upper  bound  is  a  rapidly  decreasing  function  of  b. 
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Example:  6  =  1  (traditional  column  pivoting).  We  now  evaluate  bounds  (2.17) 
and  (2.19)  when  6=1.  It  is  easy  to  see  that  the  optimal  single  column  to  choose  to 
reveal  the  rank  is  the  one  with  the  largest  norm,  and  that  this  choice  satisfies  both 
Theorem  2.1  with  /  =  1  when  k  =  1,  and  Theorem  2.4  with  F  =  1,  both  of  which 
are  unimprovable.  Furthermore,  no  matter  what  kind  of  reduction  tree  we  use  for  the 
tournament,  the  column  with  the  largest  norm  will  be  chosen  by  the  tournament  (ties 
may  be  broken  in  different  ways),  since  we  always  pick  the  column  of  largest  norm 
at  each  step.  So  when  6  =  1,  tournament  pivoting  is  equivalent  to  classical  column 
pivoting,  whose  analysis  is  well  known.  Comparing  our  bounds  from  (2.17)  and  (2.19) 
to  the  optimal  value  of  F  =  1,  we  get  F^^  n  <  y Jn/2  and  F£  <  So  the 

analysis  of  the  binary  tree  is  reasonably  close  to  the  best  bound,  although  the  flat 
tree  analysis  gives  a  weaker  bound. 

2.3.2.  Extending  the  analysis  to  a  full  rank- revealing  QR  factorization. 

In  this  section  we  perform  a  global  analysis  to  extend  the  one-step  analysis  of  Sec¬ 
tion  2.3.1  to  a  full  rank-revealing  QR  factorization. 

Corollaries  2.6  and  2.7  in  the  last  section  showed  that  applying  tournament  piv¬ 
oting  once,  to  pick  the  leading  6  columns  of  a  matrix,  provide  a  column  permuta¬ 
tion  that  reveals  the  rank  in  the  sense  of  Theorem  2.4,  but  just  for  one  partition  of 
R  R 

R  =  **  12  ,  namely  when  Ru  is  6  x  6.  The  goal  of  this  section  is  to  show 

U  R22 

that  using  repeated  tournaments  to  select  subsequent  groups  of  6  columns  can  make 
R  rank  revealing  for  Ru  of  any  dimension  1  <  k  <  n. 

Since  the  algorithm  described  in  the  last  section  only  chooses  groups  of  6  columns 
at  a  time,  it  guarantees  nothing  about  the  order  of  columns  within  a  group.  For 
example  the  algorithm  might  do  nothing  when  given  6  or  fewer  columns.  So  to  prove 
a  rank  revealing  property  for  any  k ,  not  just  k  =  tb ,  we  must  apply  a  second  (lower 
cost)  step  where  we  apply  strong  RRQR  to  each  6x6  diagonal  block  of  R. 

Given  this  second  step,  we  will  derive  an  upper  bound  in  the  form  of  (2.3).  Our 
approach  is  to  reveal  the  “least”  rank-revealed  matrix  given  the  one-step  results  in 
Section  2.3.1. 

To  this  end,  we  first  introduce  some  new  notation.  Let  e  be  the  vector  of  all  l’s 
of  various  dimensions.  For  every  matrix  X,  let  |X|  be  the  matrix  with  each  entry 
the  corresponding  entry  of  X  in  absolute  value.  Let  N  and  Ft  be  two  matrices  of 
compatible  size,  by  the  relationship 

N  X  H, 

we  mean  that  H  —  N  is  a  non-negative  matrix.  Our  global  analysis  will  benefit  from 
the  following  lemma,  the  proof  of  which  we  omit. 

Lemma  2.8.  Let  N  and  H  be  upper  triangular  matrices  with  unit  diagonal,  and 
with  non-positive  entries  in  the  upper  triangular  part  of  H .  Assume  that 

\N\±\H\. 

Then  we  have 

|fV_1|  ^  \H~1\. 
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Now  let 


N  = 


I  N12 
I 


Nlt 

N2t 

I 


(2.20) 


where  each  Ntj  is  a  6x6  matrix  with  1  <  i  <  j  <  t;  and  'Jk(Nij)  <  for  1  < 
k  <  6.  Our  global  analysis  will  critically  depend  on  tight  estimates  of  || iV—  1 1| 2 ■  As  a 
surrogate,  we  will  instead  discuss  the  choices  of  all  the  Nj:.j  submatrices  to  maximize 
||7V_1||i.  To  start,  define 


H  = 


I  —  |JV12| 

I 


~\Nu\ 

-\N2t\ 


I 


In  other  words,  we  construct  H  by  flipping  the  signs  of  all  positive  off-diagonal  entries 
of  N.  It  follows  that  \N\  =  \H\,  and  from  Lemma  2.8  that  |iV—  1 1  +  which 

implies  that  ||./V_1||1  <  ||IL_1||1.  Define 


I  M12  •  •  •  Mu 

I  •  •  •  M2t 


I 


(2.21) 


where  Mtj  = - beeT.  It  is  easy  to  verify  that  7 (.(M^)  =  c,;  for  all  1  <  k  <  b  and 

y/b 

hence  M  satisfies  the  conditions  on  the  matrix  N  in  equation  (2.20).  Straightforward 
algebra  shows  that  1 1| !  =  IljZi  +  Vbc^j .  Lemma  2.9  below  identifies  M  as 
the  matrix  that  maximizes  || 1 1|  1  - 

Lemma  2.9.  Let  matrices  N  and  M  be  defined  by  equations  (2.20)  and  (2.21), 
respectively.  Then 


||iV-1||i<||M-1||1. 


Proof:  We  only  consider  the  matrix  N  for  which  the  upper  triangular  entries 
are  all  non-positive,  so  that  |-A-1|  =  iV-1.  We  prove  Lemma  2.9  by  induction  on  t. 
The  lemma  is  obviously  true  for  t  =  1.  For  t  >  1,  we  will  show  the  sum  of  entries 
in  every  column  of  TV-1  is  bounded  above  by  ||M-1||i.  Let  N  and  M  be  the  first 
(t  —  1)  x  (t  —  1)  block  submatrices  of  N  and  M,  respectively.  By  induction,  for  any 
1  <  k  <  (t  —  l)b,  the  sum  of  entries  in  the  fc-th  column  of  TV-1  is  bounded  above  by 
||A/-1||i  <  ||M-1||i.  For  any  (t  —  1)6+  1  <  k  <  tb,  define 


2/i 

X\ 

and  x  = 

.  2/i— 1 

.  xt- 1 
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where  yt  is  the  (k  —  (t  —  l)6)-th  column  of  Nn,  with  \\yi\\2  <  Cj.  By  definition,  y  and 
x  are  the  fc-th  columns  of  N  and  IV-1  without  the  bottom  b  components,  hence  the 
sum  of  entries  in  the  k- th  column  of  N~x  is  simply 

1  +  eTx  =  1  +  eTXj . 
j= 1 


Since  x\  =  —y\  +  J2j= 2  (— ^i,j)  Xj,  it  follows  from  the  Cauchy- Schwartz  inequality 
that 


t-i 


t- 1 


t- 1 


eTX\  =  —eTyi  +  eT  (—Nij)  Xj  <  Vbci  +  VbcieTXj  =  \fbc\  1  + 


T 

e  x 


3  > 


3= 2 

and  that 

1  T  eT x  ft  1  -j-  Vfeci  4-  (1  +  ^ci)  5Z 

2 


j=2 


i=2 


t-i 


T 

e  Xj  = 


(l  +  i/fici 


t-i 

1  +  ^  o  U.J 

3= 2 


T 

e  aq 


To  continue  with  the  same  argument  to  the  end,  we  have 


t—  1 

+  eTx  <  ]^[  (l  +  Vbc3)  =  ||M_1||i.n 


3= 1 


We  are  now  ready  to  derive  global  upper 
of  tournament  pivoting  have  been  computed: 

bounds. 

Suppose  that  at  least 

Ru  R12 

R22  ■  ■  ■ 

Rit 

R2t 

-Ri.t+i 

R-2.1  .  1 

BU  =  Q 

Rtt 

Rt,t+ 1 
Rt+i,t+i 

5 

where  Ru  is  b  x  b  for  1  <  i  <  t,  and  Rt+i,t+\  is  (n  —  tb)  x  (n  —  tfo),  i.e.  the  first  t  tour¬ 
naments  selected  the  tb  columns  yielding  Rn  through  Rtt.  In  light  of  equation  (2.3), 
we  need  to  develop  an  upper  bound  on 


( 

-R11  R\i 

R\  t 

-1 

•Ri,t+i 

\ 

2  def  2 

R22  ■ 

R‘2t 

R-2,t+l 

r  = 

\ 

Rtt 

Rt,t+ 1 

/ 

+7j  (Rt+i,t+i)  /(?  min 


f 

R 11  R 12 

R22  ■ 

Rit 

R2t 

\ 

V 

Rtt 

/ 

Let  N  be  the  matrix  defined  in  equation  (2.20)  with  Ni3  =  Ru  1  R,tj .  Then  N  satisfies 
'tk(Nij)  <  Ci,  where  the  exact  value  of  Cj  depends  on  the  reduction  tree: 


ufg2((n-(i-i)6)/6)  for  binaiT  tree  (see  Corollary  2.6) 


c,;  = 


F‘ 


-(i-i)b 


for  flat  tree  (see  Corollary  2.7). 
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’  2/i  ’ 

R\\Ri,t+i 

Let  y  = 

be  the  j-th  column  of  the  matrix 

.  Vt  . 

.  Ru1Rt,t+ 1  . 

.  We  rewrite  r 2  as 


T  =  \\N  y\\2  +  7 j  (Rt+i,t+i)  /<? mi„  (diag  (i?n,  •  •  •  ,  i?tt)  N) . 

By  Corollaries  2.6  and  2.7,  1 1 1 1 2  is  bounded  above  by  the  same  c*  as  defined 
above.  Repeating  the  same  arguments  made  in  the  proof  for  Lemma  2.9,  it  is  easy  to 
show  that 


<  llJV-^Hr  <  \\M~ 


Vbe 


Vbe 


l  l 

J^[  (l  +  Vbcj'j  —  1  <  JJ  ^1  +  Vbe 


3= 1 


3= 1 


For  the  second  term  in  r2,  we  note  by  Corollaries  2.6  and  2.7  that 

7 j  (Rt+i,t+i)  /er  min  (Rii')  ^  Cj  7:  Cl 

for  all  1  <  i  <  t.  This  implies  that 

lj  (Rt+i,t+i)  /crmin  (diag  (i?n,  •  •  •  ,Rtt))  <  ci- 

Hence 

7j  (Rt+i,t+i)  /o-min  (diag  (i?n,  •  •  •  ,Rtt)N) 

<  7 j  (i2t+i,t+i)  /^min  (diag  (i?n,  ■  •  •  ,#«))  ||iV— 1 1|2 


< 


Civ^||iV  1 1|  1  <  CiVrc||M  1||i  =  \fnci  ]^[  (l  +  Vbe 

Putting  it  all  together, 


3= 1 


t—i 


T2  < 


(l  +  VbcjJ  +  Vnci  ^1  +  Vbe 


U  =  1 


3=1 


< 


^1  +  Vbct'j  +nc\  )  (n(*+A»)) 

2  /  t—i 

^1  +  Vb  +  nci  )  n(i+v/kb) 


To  further  simplify  the  upper  bound  on  r2,  we  consider  the  two  special  cases  of 
the  reduction  tree: 

Binary  tree:  In  this  case,  we  simply  use  the  relation 

1  /  nr  ..\log2 ("/*>) 

Cj  <  Cl  = 


V2b 


2  fb 


so  that 


T2  < 


log  2(n/b)\2n/b 


/  / -  \2  /  r  \2n/b  /  -  \2  /  \log2(n/6)\ 

(^1  +  Vb  +  ncij  ^1  +  vbcij  =  (l  +  Vb  +  ncij  (  1  +  -j=  [V2  fbj  J 
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Flat  tree:  In  this  case,  we  note  that 


T2  < 


(l  +  v^ 


nci 


n 

U=1 


2 

c,-  I  e  3=1  ^ 


(l  +  Vb 


+  nc\  b 


ut- 1 


T-i 

n« 

U'  =  l 


2  Y^t-1 

,  Vb  Z-j  =  \  Cj 


Since 

53  -  =  ^26  Z]  (y^fb)  ~{n/b~:)  <  V2b/  (l  -  l/(v/2/6))  <  4v/6, 

j=i  Ci  t=i 


it  follows  that 


We  are  now  ready  to  state  the  main  theorem  of  this  section. 

Theorem  2.10.  Let  A  be  m  x  n  with  m  >  n,  and  assume  for  simplicity  that  b 
divides  n.  Suppose  that  we  do  QR  factorization  with  tournament  pivoting  on  A  n/b 
times,  each  time  selecting  b  columns  to  pivot  to  the  left  of  the  remaining  matrix.  Then 
for  ranks  k  that  are  multiples  of  b,  this  yields  a  rank-revealing  QR  factorization  of  A 
in  the  sense  of  Theorem  2-4,  with 


(l  +  y/b  +  ?rci)  +  ^75  (V%fb)  g2*  ^  ^  ,  for  binary  tree 

e4  (l  +  \/b  +  nci)  (l/V2)n/b  (y/2 fffjn^b('n^b+1^2  ^  for  fiat  tree. 


Although  the  binary  tree  bound  is  relatively  smaller,  we  must  note  that  both 
bounds  in  Theorem  2.10  are  super-exponentially  large.  In  contrast,  our  numerical 
results  in  section  4  are  much  better.  As  with  conventional  column  pivoting  (the 
special  case  6=1),  the  worst  case  must  be  exponential,  but  whether  these  bounds 
can  be  significantly  tightened  is  an  open  question.  The  matrix  M  in  equation  (2.21) 
was  constructed  to  have  the  largest  1-norm  in  the  inverse  among  all  matrices  that 
satisfy  equation  (2.20),  which  is  likely  to  be  a  much  larger  class  of  matrices  than 
CA-RRQR  actually  produces. 

Example:  6  =  1  (traditional  column  pivoting).  We  now  evaluate  the  bound  in 
Theorem  2.10  when  6  =  1,  i.e.  traditional  column  pivoting.  In  this  case  we  know  the 
best  bound  is  0( 2n),  which  is  attained  by  the  upper  triangular  Kahan  matrix,  where 
Kj,  =  c1-1  and  Kjj  =  —  sc1^1  where  j  >  i,  s2  +  c2  =  1  and  s  and  c  are  positive.  In 
contrast,  our  bounds  are  much  larger,  at  0(n”/2)  and  0(2”2/4),  respectively. 

Finally,  we  consider  the  case  of  k  not  a  multiple  of  6.  We  only  sketch  the  derivation 
of  the  bounds,  since  explicit  expressions  are  complicated,  and  not  more  illuminating. 
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Recall  that  we  must  assume  that  strong  RRQR  has  been  independently  performed  on 
the  6  x  6  diagonal  block 


Riill 

0 


Riil2 

Rii22 


of  the  final  result  of  tournament  pivoting,  with  analogous  notation  R,:j  -|  and  Rij2, 
with  j  >  i,  for  the  subblocks  of  Rij  to  the  right  of  Ran  and  Ru 22,  resp.,  and  Rkii 
and  Rki2i  with  k  <  i,  for  the  subblocks  of  R^i  above  Ran  and  Ru  12,  resp.  yielding 


We  first  need  to  bound  the  2-norm  of  each  column  of  the  matrix 

Ru  R\  2  R\  3  Ru 

0  Ran  Rui2  R24 

_  R^R-13  —  R-11  Rl2RiiiiRiil2  R\\  R\4  ~  Ry^  R\2Rjy\\  R-24 

RiillRiil2  RiillRu 

If  we  can  bound  the  2-norm  of  every  column  of  all  4  subblocks  in  the  above  expression, 
then  their  root-sum-of-squares  will  bound  the  overall  column  2-norms: 

•  column  norms  of  f?111Ri2,  R^\  R\ 3 ,  and  R^Rn  are  all  bounded  by  our  pre¬ 
vious  result  on  tournament  pivoting  alone,  since  R\\  is  (i  —  1)6  x  (*  —  1)6. 

•  Ryjyi i Ravi  is  bounded  since  we  did  strong  RRQR  on  Ru. 

•  To  see  that  columns  of  R~aiR 24  are  bounded  we  note  that  columns  of  R~^Rij 
are  bounded  (for  j  >  i)  by  our  previous  result  for  tournament  pivoting  alone. 
Performing  strong  RRQR  on  Ru  changes  Ru  to  QRuTl  and  R.tj  to  QRij  for 
some  permutation  II  and  orthogonal  Q,  so  columns  of 


R,, '  R,j  =  n  •  (QRuU)  1(QRij) 

—  n  Ru  12  R24 

0  Ru 22  J  [  R34 

_  1 1  RiillRu  RuilRiil2Ru22-R-34: 

Rii22R-34 
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are  bounded  by  our  previous  result,  and  so  for  each  j 


7j  +  7 j  {^-iill^-24  ^4411-^**12-^4422-^34^  +7 j  (-^4411-^**12-^4422-^34^ 

—  7 j  {R%i  Rij)  +  \\RiillRiii2W2  ’  7?  (-^ii22-^34^ 

—  7j  (-^44  ^*j)  +  11^411^**12112  ’  7 j  ( R%i  Rij ) 

<  (1  + 1 1 -^uii -^**12 II  2)7.7  {Ru1  Rij) 

is  bounded. 

•  Finally,  we  can  combine  these  bounds  to  bound  above 

7 j  (-^11  Riz  ~~  R\\  -^12-^4411-^4412^  +  7 j  (-^11  ^13^+II-Rh  -R12 1| 2 "7y  (-^4411-^**12) 
and 

7 j  {R\\Ri4  ~  R-11  -^12-^4411-^24)  <  7 j  (-Rn1-Ri4)+||-Rn1-Ri2||2-7i  (-^4411-^24) 

Now  we  must  bound  from  above 


7 3 


Rii'22  R34 

0  R44 


/^min 


-R11  -R12  \ 

0  Run  J 


Using  the  previously  established  bounds  on  column  norms  of  i?111i?i2  and  Rii22R34i 
it  is  as  before  enough  to  bound  above 


7 j 


=  7 3 


Ri  422 

0 


0 

i?44 

R'ii‘22  0 

0  It]  1 


/+iniTi 

/  min  (cr. 


Ru 

0 


0 

-Riiii 


1 )  -^^1 1 )  5  'Urnin  (Rill  l  )  ) 


By  the  strong  rank- revealing  property  for  k  =  (i  —  1)6,  we  know  that  <7min(-Rn) 
cannot  be  much  smaller  than  7 j  (Run),  so  the  denominator  in  the  last  fraction  can 
be  replaced  by  <Jmin(Riiii) ■  Similarly,  the  numerator  can  be  replaced  by  7-,-  (Ru22)- 
This  leaves  us  with  jj  (^4*22)  /  &min(Riai) ,  which  is  bounded  above  by  the  strong 
rank- revealing  property  of  Ru. 

3.  Performance  analysis  of  QR  with  tournament  pivoting.  In  this  section 
we  analyze  the  performance  of  parallel  communication  avoiding  QR  with  tournament 
pivoting  and  we  show  that  it  minimizes  communication,  at  the  cost  of  roughly  tripling 
the  number  of  arithmetic  operations.  Algorithm  1  presents  the  parallel  CARRQR 
factorization  of  a  matrix  A  of  size  in  x  n.  The  matrix  is  distributed  block  cyclically 
on  a  grid  of  P  =  Pr  x  Pc  processors  using  blocks  of  size  6.  For  simplicity  we  suppose 
n/b  is  an  integer.  Consider  step  j  of  the  factorization,  and  let  mb  =  m  —  ( j  —  1)6, 
nb  =  n —  (j  — 1)6.  The  size  of  the  matrix  on  which  the  algorithm  operates  at  this  step  is 
mb  x  nb .  First,  tournament  pivoting  is  used  to  identify  6  pivot  columns,  using  a  binary 
tree  of  depth  log2  Hb/br-  For  the  ease  of  the  analysis,  we  consider  that  tournament 
pivoting  is  performed  as  an  all-reduce  operation,  that  is  all  processors  participate  at 
all  the  stages  of  the  reduction,  and  the  final  result  is  available  on  all  processors. 

Depending  on  the  size  of  6t,  a  processor  can  be  involved  in  operations  at  more 
than  one  node  at  each  level  of  the  binary  tree.  At  the  leaves,  the  6  candidate  columns 
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are  selected  from  blocks  of  size  m.b  x  bx-  For  this,  the  processors  in  the  same  process 
column  perform  the  QR  factorization  of  their  blocks  of  columns,  using  TSQR  [11]. 
The  R  factors  obtained  from  this  factorization  are  available  on  all  processors.  Each 
processor  selects  b  column  candidates  by  performing  QR  with  column  pivoting  on 
their  R  factor. 

After  this  step,  there  are  nb/bx  candidate  columns.  The  subsequent  steps  of  the 
reduction  operate  on  blocks  of  2b  columns,  where  the  new  column  candidates  are 
chosen  similarly  by  computing  TSQR  followed  by  QR  with  column  pivoting  on  the  R 
factor.  In  the  first  log2  (n/ ( bPc ))  levels  of  the  reduction  tree,  a  processor  is  involved  in 
the  computation  of  several  nodes  per  level.  In  other  words,  it  is  the  owner  of  several 
sets  of  candidate  columns.  Hence  it  can  form  with  no  communication  the  matrix  of 
2b  column  candidates  of  the  next  level.  At  the  last  log2  Pc  levels  of  the  reduction 
tree,  a  processor  is  involved  in  the  computation  performed  at  only  one  node  per  level. 
Hence  the  processors  in  the  same  process  row  need  to  exchange  their  local  parts  of 
the  candidate  columns. 


Algorithm  1  Parallel  QR  with  tournament  pivoting 
1:  Input  matrix  A  of  size  m  x  n,  block  size  b ,  bx 
2:  for  j  =  1  to  n/b  do 
3:  Let  mb  =  m  —  ( j  —  1)6,  rib  =  n  —  ( j  —  1)6 

4:  /*  Perforin  tournament  pivoting  to  choose  6  pivot  columns  */ 

5:  for  k  =  1  to  rib/(2bPc)  do 

6:  Let  A0fc  be  the  fc-th  column  block  of  size  mb  x  26  that  belongs  to  the  process 

column  of  my  processor. 

7:  Processors  in  the  same  process  column  select  6  column  candidates  by  per¬ 

forming  TSQR  of  their  block  Aofc  followed  by  strong  RRQR  of  the  R  factor. 

8:  end  for 

9:  for  i  =  1  to  log^inb/bx)  do 

10:  for  each  node  k  =  1  to  rib/{2bPc2l)  in  the  current  level  assigned  to  my 

processor  do 

11:  if  at  most  one  node  is  assigned  per  processor  then 

12:  Each  two  processors  in  the  same  process  row  exchange  their  local  parts 

of  the  6  column  candidates  selected  at  the  sons  of  their  current  node  k. 

13:  end  if 

14:  Let  Aik  be  the  matrix  obtained  by  putting  next  to  each  other  the  two  sets 

of  6  column  candidates  selected  at  the  sons  of  the  current  node  k. 

15:  Processors  in  the  same  process  column  select  6  column  candidates  by  per¬ 

forming  TSQR  of  their  block  A^  followed  by  strong  RRQR  of  the  R  factor. 

16:  end  for 

17:  end  for 

18:  Swap  the  6  pivot  columns  to  j-th  column  block:  A(:,jo  :  end )  =  A(:,jo  :  end)Tlj, 

where  jo  =  (j— 1)6+1  and  H,  is  the  permutation  matrix  returned  by  tournament 
pivoting. 

19:  TSQR  factorization  of  the  j-th  column  block  A  (jo  :  end,  jo  :  j  i)  =  QjRj,  where 

j i  =  jb,  Rj  is  an  upper  triangular  matrix. 

20:  Update  of  the  trailing  matrix  A(jo  :  end,j\  +  1  :  end)  =  Qj A(jo  :  end,j\  +  1  : 

end). 

21:  end  for 
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To  study  the  theoretical  performance  of  CARRQR,  we  use  a  simple  model  to 
describe  a  machine  architecture  in  terms  of  speed,  network  latency,  and  bandwidth. 
The  time  needed  to  send  a  message  of  m  words  between  two  processors  is  estimated  to 
be  cc  +  to/3,  where  a  specifies  the  latency  and  /3  the  inverse  of  the  bandwidth.  The  cost 
of  performing  parallel  CARRQR  of  an  to  x  n  matrix  is  given  in  Table  3.1.  We  display 
separately  the  cost  of  performing  communication  avoiding  QR  factorization  and  the 
cost  of  performing  tournament  pivoting,  with  bp  =  2 b  and  bp  =  n/Pc.  The  total  cost 
of  CARRQR  is  the  cost  of  performing  CAQR  plus  the  cost  of  performing  tournament 
pivoting.  The  parallel  CARRQR  factorization  based  on  tournament  pivoting  with 
bp  =  2 b  requires  three  times  more  flops  than  the  QR  factorization.  However  for 
bp  =  n/Pc,  the  parallel  CARRQR  factorization  performs  a  factor  of  n/{bPc )  more 
flops  than  the  QR  factorization.  Hence  the  former  choice  of  bp  =  2&  should  lead  to  a 
faster  algorithm  in  practice. 


Parallel  CAQR 

#  flops 

2mn2— 2na/3  ,  bn2  ,  3bn(2m  —  n)  ,  f  4b2 n  ,  n2(3b+5)'\  d 

P  1  2 Pc  1  2 Pr  \  3  2Pc  )  i0°2  ^ r  °  U 

ff  words 

(  £  +  ¥ )  l°g2  Pr  +  (  +  2 n)  log2  Pc 

H  messages 

%  log2  Pr+  f  log2  Pc 

Tournament  pivoting  with  br  =  2 b  (excluding  cost  of  QR  factorization) 

H  flops 

W-4n*/3  +  *^(l0g2  Pr  +  2)  +  f(m,  U,  b ,  Pr ,  Pc) 

H  words 

£  log2  Pr  +  (log2  Pc  +  1) 

H  messages 

26”p„  log2  Pr  +  "  l0g2  Pc{ l0g2  Pr  +  1) 

Tournament  pivoting  with  br  =  n/Pc  (excluding  cost  of  QR  factorization) 

H  flops 

2  rnuf-nV 4  _n_  +  1  ^  _n_  Pr  +  2)  +  f(m,  71,  b,  Pr ,  Pc) 

H  words 

6  Pr  bPr  l0S2  Pr+  Pr  '  T  l)  T  27l6  l0g2  Pr  l0g2  PC 

#  messages 

J  (log2  Pr  +  2  log2  Pc  +  log2  Pr  log2  Pc) 

Table  3.1 


Performance  models  of  parallel  CAQR  and  tournament  pivoting  when  factoring  an  in  X  n 
matrix,  distributed  in  a  2-D  block  cyclic  layout  on  a  Pr  X  Pc  grid  of  processors  with  square  b  X  b 
blocks.  All  terms  are  counted  along  the  critical  path.  We  generally  assume  m  >  n.  The  cost 
of  CARRQR  is  equal  to  the  cost  of  tournament  pivoting  plus  the  cost  of  CAQR.  In  the  table, 
f(m,  n,  b,  Pr,Pc)  =  4(2mp~"2b)  log2  Pc  +  ^  log2  Pr  log2  Pc  +  11  nb2  log2  Pc. 


We  recall  now  briefly  the  lower  bounds  on  communication  from  [3],  which  apply 
to  QR  under  certain  hypotheses.  On  a  sequential  machine  with  fast  memory  of  size 
M  and  slow  memory,  a  lower  bound  on  the  volume  of  data  and  on  the  number  of 
messages  transferred  between  fast  and  slow  memory  during  the  QR  factorization  of  a 
matrix  of  size  m  x  n  is 

(in n ^ \  (  77? n,  \ 

y==J  ,  #  messages  >  Q  •  (3-1) 

When  the  matrix  is  distributed  over  P  processors,  the  size  of  the  memory  per 
processor  is  on  the  order  of  0(mn/P),  and  the  work  is  balanced  among  the  P  pro¬ 
cessors,  then  a  lower  bound  on  the  volume  of  data  and  the  number  of  messages  that 
at  least  one  of  the  processors  must  transfer  is 


#  words  >  fi 


#  messages  >  Q 


(3.2) 
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To  minimize  communication  in  parallel  CARRQR,  we  use  an  optimal  layout,  with 
the  same  values  as  in  [10], 


P,  = 


and  b  =  B  ■ 


(3-3) 


The  number  of  flops  required  by  tournament  pivoting  with  bx  =  2 b,  which  does  not 
include  the  number  of  flops  required  for  computing  the  QR  factorization  once  the 
pivot  columns  have  been  identified,  is 


„  ,,  4mn2  —  4n3/3 
#flops=  - - - b 

+  ~]5~  (8 B  Q  lo§2  pr  +  log2  Pc  +  0  +  32 B2  Q  log2  Pr  log2  Pc  +  i  log2  Pc^j  ^ 


By  choosing  B  —  8_1log^1  (P^log^1  (Pc),  the  first  term  in  # flops  dominates  the 
redundant  computation  due  to  tournament  pivoting.  This  is  a  term  we  cannot  de¬ 
crease.  After  dropping  some  of  the  lower  order  terms,  the  counts  of  the  entire  parallel 
CARRQR  factorization  become 


#flops 

#vuords 

# messages 


6mn2  —  6n3/3  2 

— b  cmn  ,  c  <  1, 


P 


V 


W 


log2 


mP 


log2 


m  I 


,  z  ,mP,  2  nP 

—  loS2  \  - !og2  \  — , 


and  this  shows  that  parallel  CARRQR  is  communication  optimal,  modulo  polyloga- 
rithmic  factors. 

Appendix  B  describes  briefly  the  performance  model  of  a  sequential  communica¬ 
tion  avoiding  QR.  It  shows  that  sequential  CARRQR  performs  three  times  more  flops 
than  QRCP,  transfers  0(mn2 /M1/2)  number  of  words  and  exchanges  @(mn2/M3/2) 
number  of  messages  between  the  slow  memory  and  the  fast  memory  of  size  M  of  the 
sequential  computer.  Thus  sequential  QR  attains  the  lower  bounds  on  communica¬ 
tion.  As  explained  in  the  introduction,  neither  sequential  QRCP  nor  parallel  QRCP 
are  communication  optimal.  We  note  however  that  parallel  QRCP  minimizes  the 
volume  of  communication,  but  not  the  number  of  messages. 

4.  Experimental  results.  In  this  section  we  discuss  the  numerical  accuracy  of 
our  CARRQR  and  compare  it  with  the  classic  QR  factorization  with  column  pivoting 
(QRCP)  and  the  singular  value  decomposition.  We  focus  in  particular  on  the  value 
of  the  elements  obtained  on  the  diagonal  of  the  upper  triangular  factor  R  (referred 
to  as  R-values)  of  the  factorization  All  =  QR,  and  compare  them  with  the  singular 
values  of  the  input  matrix.  For  the  CARRQR  factorization,  we  test  binary-tree- 
based  CARRQR  and  flat-tree-based  CARRQR,  denoted  in  the  figures  as  CARRQR-B 
and  CARRQR-F  respectively.  The  singular  values  are  computed  by  first  performing 
a  QR  factorization  with  column  pivoting  of  A,  and  then  computing  the  singular 
values  of  the  upper  triangular  factor  R  with  the  highly  accurate  routine  DGESVJ 
[13,  14].  We  use  this  approach  to  compute  the  singular  values  since  given  a  matrix 
A  =  BY  (or  A  =  YD),  where  D  is  diagonal  and  Y  is  reasonably  well  conditioned, 
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the  Jacobi  SVD  algorithm  in  DGESVJ  computes  the  singular  values  with  relative 
error  of  at  most  0{e)n(Y ),  whereas  the  SVD  algorithm  based  on  the  QR  iteration 
has  a  relative  error  of  at  most  0(e)«(A).  Here  k(A)  is  the  condition  number  of  A, 
k(A )  =  ||A||2  •  ||H^1||9.  This  choice  affects  in  particular  the  accuracy  of  the  small 
singular  values.  Not  all  our  input  matrices  A  have  the  scaling  property  that  A  can  be 
expressed  as  the  multiplication  of  a  diagonal  matrix  and  a  well-conditioned  matrix,  but 
after  computing  All  =  QR  by  using  a  rank  revealing  QR  factorization  and  R  =  DY , 
we  expect  to  obtain  a  well  conditioned  Y.  Even  when  Y  is  not  well  conditioned,  it  is 
still  expected  that  DGESVJ  of  R  is  more  accurate  than  SVD  of  R. 

In  the  plots  displaying  R.-values  and  singular  values,  we  also  display  bounds  for 
trustworthiness  computed  as, 


emin{||(Hn0)(:,z)||2,||(Hn1)(:,i)||2,||(An2)(:,Q||2}  (4.1) 

emax{||(Hn0)(:,z)||2,||(Hn1)(:,i)||2,||(An2)(:,i)||2}  (4.2) 


where  Aj(j  =  0, 1,  2)  are  the  permutation  matrices  obtained  by  QRCP,  CARRQR-B, 
and  CARRQR-F  (represented  by  the  subscripts  j  =  0,1,2  respectively),  (An)(:,i)  is 
the  i-th  column  of  the  permuted  matrix  An,  and  e  is  the  machine  precision.  Since  the 
algorithm  can  be  interpreted  as  applying  orthogonal  transformations  to  each  column 
separately,  machine  epsilon  times  the  norm  of  column  i  is  a  reasonable  estimate  of  the 
uncertainty  in  any  entry  in  that  column  of  R,  including  the  diagonal,  our  estimate  of 
(jj.  Therefore  the  quantities  in  (4.1)  and  (4.2)  describe  the  range  of  uncertainties  in 
tJj,  for  all  three  choices  of  column  i ,  from  the  three  pivoting  schemes  tested. 

For  each  matrix  in  our  test  set  we  display  the  ratio  R(i,  i) /cr, ,  where  R  is  the  upper 
triangular  factor  obtained  by  QRCP,  CARRQR-B  or  CARRQR-F,  R(i,  i)  denotes  the 
*-th  diagonal  element  of  R,  assumed  to  be  nonnegative,  and  cr,  is  the  *-th  singular 
value  computed  as  described  above.  Consider  the  factorization  An  =  QR  obtained 
by  using  QRCP,  CARRQR-B,  or  C'ARRQR-F.  Let  D  =  diag(diag(R))  and  define 
Y  by  R  =  DYT .  Then  we  have  An  =  QDYT .  The  R- values  of  QRCP  decrease 
monotonically,  and  it  follows  straightforwardly  from  the  minimax  theorem  that  the 
ratio  R(i,i)/cri  can  be  bounded  as 


1  R(i,i) 

M  - 


<r-1ll, 


(4.3) 


and  these  lower  and  upper  bounds  are  also  displayed  in  the  corresponding  plots. 

As  described  in  [26],  if  after  a  first  factorization  An  =  QR,  a  second  QR  factor¬ 
ization  of  RT  is  performed,  i?Tni  =  Q\LT ,  then  the  diagonal  elements  of  the  lower 
triangular  factor  L  approximate  more  accurately  the  singular  values  of  the  input  ma¬ 
trix.  The  obtained  factorization  A  =  QA\LQ^At  is  called  a  QLP  factorization,  a 
special  case  of  ULV  decomposition,  where  Q  and  Q\  are  orthogonal  matrices  that 
approximate  the  left  and  right  singular  subspaces.  Stewart  notes  that  there  is  no 
need  to  perform  pivoting  in  the  second  QR  factorization,  and  hence  a  communica¬ 
tion  optimal  (modulo  polylogarithnic  factors)  QLP  factorization  can  be  obtained  by 
using  QR  with  tournament  pivoting  for  the  first  factorization,  and  then  CAQR  for 
the  second  factorization.  In  our  numerical  tests  we  use  the  QLP  factorization  only 
for  two  matrices,  the  devil’s  stairs  and  Kahan  matrix,  two  challenging  matrices  for 
rank-revealing  factorizations  also  used  in  [26].  For  the  other  matrices,  we  test  QRCP, 
CARRQR-B,  and  CARRQR-F.  The  entire  set  of  matrices  is  presented  in  Table  4.1. 
Unless  otherwise  specified,  the  matrices  are  of  size  256  x  256  and  the  block  size  is 
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b  =  8.  The  binary  tree  version  of  CARRQR  uses  a  tournament  that  has  2b  columns 
at  the  leaves  of  the  reduction  tree.  These  matrices  were  used  in  several  previous  pa¬ 
pers  focusing  on  rank  revealing  factorizations,  as  [4,  20,  26,  23].  A  short  description 
is  given  in  the  table.  In  addition,  we  describe  in  more  detail  here  matrices  Break-1, 
Break- 9,  H-C,  and  Stewart,  which  are  random  matrices  of  size  n  x  n  with  pre¬ 
scribed  singular  values  {cr.;}.  The  first  three  matrices  are  of  the  form  A  =  UT,VT , 
where  U,  V  are  random  orthogonal  matrices  and  E  is  a  diagonal  matrix.  For  Break- 
1,  E  has  the  following  diagonal  entries  a\  =  ...  =  on-\  =  l,cr„  =  10~9  [4].  For 
Break-9,  the  diagonal  entries  of  E  are  or  =  . . .  =  cr„_ 9  =  1,  crn-g  =  . . .  =  an  =  10-9 
[4].  For  H-C,  E  has  diagonal  entries  100,  10,  and  the  following  n  —  2  are  evenly 
spaced  between  10-2  and  10-8  [23].  For  Stewart,  A  =  UY,VT  +  O.la^oE  [26], 
where  E  is  a  diagonal  matrix  with  the  first  50  diagonals  decreasing  geometrically 
from  1  to  10-3  and  the  last  n  —  50  diagonals  being  set  to  zero,  U  and  V  are  random 
orthogonal  matrices,  FI  is  a  matrix  with  random  entries  chosen  from  a  uniform  distri¬ 
bution  in  the  interval  (0, 1).  In  Matlab  notation,  d=linspace(l,le-3,n);  d(51:end)=0; 
A=orth(rand(n))*diag(d)*orth(rand(n))-|-0.1*d(50)*rand(n) 

We  first  discuss  the  Kahan  matrix  and  the  devil’s  stairs.  The  Kahan  matrix 
is  presented  in  equation  (4.4)  where  c2  +  s2  =  1.  For  c  =  0,  the  singular  values 
are  sl,i  =  0, ... ,  n  —  1.  An  increasing  gap  between  the  last  two  singular  values  is 
obtained  when  the  value  of  c  is  increased.  Since  crmin(F?n)  <  <7k{A)  in  (1.1),  we 
would  like  to  find  n  such  that  cr,nin  ( Ai  1 )  is  sufficiently  large.  For  the  Kahan  matrix, 
it  is  known  that  Ofc(A) / CTm;n(I?ii)  >  |c3(  1  +  c)n_4/s,  and  ermin(Aii)  can  be  much 
smaller  than  <Jk(A)  [20].  Traditional  QR  with  column  pivoting  does  not  permute  the 
columns  of  A  in  exact  arithmetic  and  fails  to  reveal  the  gap  between  the  last  two 
singular  values.  It  is  easy  to  notice  that  CARRQR  does  not  permute  the  columns  in 
exact  arithmetic  either,  if  we  assume  that  during  tournament  pivoting,  ties  are  broken 
by  choosing  the  leftmost  column  when  multiple  columns  have  the  same  norm.  This 
results  in  poor  rank  revealing,  similarly  to  QRCP.  However  in  finite  precision,  both 
QRCP  and  CARRQR  might  permute  the  columns  of  the  matrix  and  in  this  case  both 
reveal  the  rank,  as  the  results  displayed  in  Table  4.2  show,  where  we  compare  the  last 
two  singular  values  with  the  last  two  L-values  and  R-values  obtained  from  QRCP, 
CARRQR-B,  and  CARRQR-F. 

To  avoid  a  possible  non-trivial  permutation  in  finite  precision  (a  well-known  phe¬ 
nomenon  [12]  of  the  Kahan  matrix),  we  multiply  the  j-th  column  of  the  Kahan  matrix 
by  (1  —  t)-?-1  for  all  j  and  a  small  r  e.  The  additional  numerical  experiments  we 
have  performed  revealed  that  when  r  is  small,  all  three  algorithms,  QRCP,  CARRQR- 
B,  and  CARRQR-F,  pivot  the  columns  of  A  and  are  effective  in  revealing  the  rank. 
However,  when  the  parameter  r  is  increased,  for  example  for  r  =  10  ,  the  three 

algorithms  do  not  perform  any  pivoting,  and  result  in  poor  rank  revealing. 


Ht  is  not  exactly  the  matrix  used  in  Figure  2.1  in  [26]. 


24 


No. 

Matrix 

Descriptions 

1 

Baart 

Discretization  of  the  1st  kind  Fredholm  integral  equation  [21]. 

2 

Break-1 

Break  1  distribution,  matrix  with  prescribed  singular  values, 
see  description  in  the  text  and  [4]. 

3 

Break- 9 

Break  9  distribution,  matrix  with  prescribed  singular  values, 
see  description  in  the  text  and  [4]. 

4 

Deriv2 

Computation  of  second  derivative  [21]. 

5 

Exponential 

Exponential  Distribution,  a\  =  1,  cq  =  <P_1  ( i  =  2,  ...,n), 
a  =  10-1/n  [4], 

6 

Foxgood 

Severely  ill-posed  test  problem  [21]. 

7 

Gks 

An  upper-triangular  matrix  whose  j-tli  diagonal  element  is 
l/v(?  and  whose  (■ i,j )  element  is  —  1/y/j,  for  j  >  i  [17,  20]. 

8 

Gravity 

ID  gravity  surveying  problem  [21]. 

9 

H-C 

Matrix  with  prescribed  singular  values,  see  description  in  the 
text  and  [23]. 

10 

Heat 

Inverse  heat  equation  [21]. 

11 

Phillips 

Phillips’  famous  test  problem  [21]. 

12 

Random 

Random  matrix  A  =  2  *  rand(n)  —  1  [20]. 

13 

Scale 

Scaled  random  matrix,  a  random  matrix  whose  i-tli  row  is 
scaled  by  the  factor  rf!n  [20].  We  choose  ?y  =  10  •  e. 

14 

Shaw 

ID  image  restoration  model  [21]. 

15 

Spikes 

Test  problem  with  a  ’’spiky”  solution  [21]. 

16 

Stewart 

Matrix  A  =  USVT  +  0.1cr5O  *  rand(n),  see  description  in  the 
text  and  [26]. 

17 

Ursell 

Integral  equation  with  no  square  integrable  solution  [21]. 

18 

Wing 

Test  problem  with  a  discontinuous  solution  [21]. 

19 

Kahan 

Kahan  matrix,  see  equation  (4.4). 

20 

Devil 

The  devil’s  stairs,  a  matrix  with  gaps  in  its  singular  values,  see 
Algorithm  2  in  Appendix  A  [26]. 

Table  4.1 
Test  matrices. 


The  devil’s  stairs  [26]  is  a  matrix  with  multiple  gaps  in  its  singular  values,  gener¬ 
ated  by  the  Matlab  code  given  in  Algorithm  2  in  Appendix  A.  From  Figure  4.1,  we 
can  see  that  the  L- values  reveal  the  gaps  in  the  singular  values,  while  the  R-values  do 
not.  Both  our  versions  of  CARRQR  factorizations  provide  a  good  first  step  for  the 
QLP  factorization,  similar  to  the  traditional  QR  with  column  pivoting. 

We  now  discuss  the  other  matrices  in  our  test  set,  given  in  alphabetical  order  in 
Table  4.1.  The  results  for  the  first  matrix,  Baart,  are  given  in  Figure  4.2.  Figure 
4.2(a)  shows  the  R-values  of  QR  with  column  pivoting  (QRCP),  CARRQR-B,  and 
CARRQR-F,  and  the  singular  values  of  this  matrix,  while  Figure  4.2(b)  displays 
the  ratios  of  R-values  to  singular  values.  In  Figure  4.2(a),  the  curves  for  R-values 
and  singular  values  are  almost  superimposed,  and  they  are  well  above  the  bounds  of 
trustworthiness,  except  for  the  last  few  elements.  Hence  the  R-values  reveal  the  rank 
for  this  matrix.  The  ratios  R(i,i)  / &i{R)  shown  in  Figure  4.2(b)  are  close  to  1,  and 
well  bounded  by  the  formulas  in  equation  (4.3).  For  all  the  other  matrices  in  our  test 
set,  similar  plots  with  similar  behaviour  are  given  in  Appendix  A,  except  that  the 
R-values  of  a  few  matrices  as  Foxgood  lie  under  the  trustworthiness  bounds  given 
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(a)  R-values  of  QRCP  and  CARRQR 


(b)  Ratios 


(c)  L-values  of  QRCP  and  CARRQR 


Fig.  4.1.  The  devil’s  stairs ,  matrix  of  size  128  x  128.  The  roughly  horizontal  dotted  lines  in 

(a)  stand  for  the  upper  and  lower  bounds  given  in  (4.1)  and  (4.2);  the  horizontal  dotted  lines  in 

(b)  denote  the  upper  and  lower  bounds  from  (4.3),  where  the  Y  factors  are  obtained  from  QRCP, 
CARRQR-B,  and  CARRQR-F  respectively. 
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singular  values  QRCP  CARRQR-B  CARRQR-F 


(Tn—l 

an 

rn—i/an—i 

rn/an 

r  n— 1/ an—\ 

"f'nj  an 

r n—i/an—i 

r„/c r„ 

0.1 

5.57E-01 

5.71E-06 

1.0488 

2.4004 

1.0488 

2.4004 

1.0488 

2.4004 

0.2 

8.37E-02 

1.26E-11 

1.0954 

7.7787 

1.0954 

7.7787 

1.0954 

7.7787 

0.3 

3.00E-03 

1.59E-17 

1.1402 

1.5650 

1.1402 

1.5650 

1.1402 

1.5650 

0.4 

2.01E-05 

7.96E-24 

1.1832 

2.8006 

1.1832 

1.4289 

1.1832 

1.4350 

0.5 

1.65E-08 

9.26E-31 

1.2247 

2.0125 

1.2247 

2.0014 

1.2247 

2.0021 

0.6 

7.79E-13 

1.06E-38 

1.2649 

1.2810 

1.2649 

1.2894 

1.2649 

1.2811 

singular  values 

QRCP 

CARRQR-B 

CARRQR-F 

an— 1 

an 

In— l/ an— 1 

In/ an 

ln—l/an—1 

In/ an 

In— 1/  an—i 

In/  an 

0.1 

5.57E-01 

5.71E-06 

1.0223 

1.0000 

1.0223 

1.0000 

1.0223 

1.0000 

0.2 

8.37E-02 

1.26E-11 

1.0392 

1.0000 

1.0392 

1.0000 

1.0392 

1.0000 

0.3 

3.00E-03 

1.59E-17 

1.0512 

1.0000 

1.0512 

1.0000 

1.0512 

1.0000 

0.4 

2.01E-05 

7.96E-24 

1.0583 

1.0000 

1.0583 

1.0000 

1.0583 

1.0043 

0.5 

1.65E-08 

9.26E-31 

1.0607 

1.0000 

1.0607 

0.9945 

1.0607 

0.9948 

0.6 

7.79E-13 

1.06E-38 

1.0583 

1.0000 

1.0583 

1.0065 

1.0583 

1.0000 

Table  4.2 

Kahan  matrix  (of  size  n  =  128J,  comparison  between  last  two  singular  values,  R-values,  and 
L-values.  In  the  table,  rn—i,rn  are  the  last  two  R-values,  and  ln—i,ln  are  the  last  two  L-values. 


in  equations  (4.1)  and  (4.2). 

Figures  4.3,  4.4,  and  4.5  summarize  these  results.  Figure  4.3  presents  the  ratios  of 
the  R-values  obtained  from  the  different  QR  factorizations  to  the  singular  values.  Let 
r  be  the  vector  of  R-values  obtained  from  QRCP/CARRQR,  and  let  s  be  the  vector 
of  singular  values.  The  R-values  obtained  from  CARRQR  are  not  in  descending  order 
and  we  do  not  sort  them.  The  values  displayed  in  Figure  4.3  are  the  minimum  of  the 
ratios  min(r./s),  the  maximum  of  the  ratios  max(r./ s),  and  the  median  of  the  ratios 
median{r./ s).  For  all  the  test  cases,  we  can  see  that  the  ratios  are  close  to  1,  and 
are  well  bounded  as  in  equation  (4.3).  We  conclude  that  the  R-values  obtained  by 
communication  avoiding  RRQR  reveal  the  rank  for  all  the  matrices  in  our  test  set. 


(a)  R-values 


(b)  Ratios 


Fig.  4.2.  R-values  and  ratios  of  QR  with  column  pivoting  ( QRCP),  binary-tree-based  CARRQR 
(CARRQR-B),  and  flat-tree-based  CARRQR  (CARRQR-F)  of  Baart  matrix.  The  dotted  lines  in 
(a)  display  the  upper  and  lower  bounds  given  in  (4.1)  and  (4.2);  the  dotted  lines  in  (b)  represent  the 
upper  and  lower  bounds  in  (4.3),  where  the  Y  factors  are  obtained  from  QRCP,  CARRQR-B,  and 
CARRQR-F  respectively. 

In  the  plots  displaying  R-values  and  singular  values  in  Figure  4.2  and  in  Appendix 
A,  it  might  be  hard  to  distinguish  if  the  computed  R-values  are  above  or  below  the 
bounds  of  trustworthiness  from  equations  (4.1)  and  (4.2).  For  the  purpose  of  clarity, 
we  display  in  Figure  4.4  the  ratio  min.;{  e| | (Tn)'('— 1 1 I  f°r  the  first  18  matrices  in  our 
test  set.  The  R  factor  is  obtained  from  QRCP,  CARRQR-B,  and  CARRQR-F.  It  can 
be  seen  that  for  the  matrices  numbered  1,  5  to  8,  10,  14,  15,  and  18,  this  ratio  is 
smaller  than  1.  But  as  displayed  in  Figure  4.3,  the  ratios  R{i,i) / <Ji(R)  are  close  to  1, 
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and  well  bounded  by  equation  (4.3).  In  other  words,  even  when  e||( xnjr i)||  small 
and  we  do  not  expect  an  accurate  computation  of  the  R-values  due  to  round-off  error, 
the  computed  R-values  are  still  very  close  to  the  accurately  computed  singular  values 
of  the  R  matrix. 


i  i  ji  -i-  JL 

r  > 

if  j  k  J 

IT’- 

JI  J  1  L fl  p 

median 

■  -  ■  -  ■  lower  bound  | 
-  -  -  upper  bound  : 

-  ^  |  1 

in 

D  TJ 

TJ:  1  L  J 

r  1  Tl'-pt  l 

n'v-' 

¥')  -n-' 

i  i 

i 

JI 

n 

i  - 

■  

median 

-  ■  -  ■  lower  bound  ; 
-  -  -  upper  bound  : 

1 

m 

in 

n  □  t 

- ! 

□ 

□ 

□ 

□ 

rr 

1  2  3 

4 

5 

6  7 

8 

9 

10 

11  12 

13 

5 

16 

17  18 

ja  JI  n  - 

n  v- 

■  n  -n- 

1  n 

«  ' 

JI 

P 

i  - 

n  

median 

-  ■  -  ■  lower  bound  ; 
-  -  -  upper  bound  : 

1 

in 

in 

inn 

^i 

□ 

□ 

□ 

□ 

rr 

0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18 


Fig.  4.3.  Minimum,  maximum,  and  median  of  the  ratios  R(i,i)/cri(R),  where  i  =  1, . . .  ,n  and 
n  is  the  size  of  the  input  matrix.  The  dotted  horizontal  lines  represent  the  upper  and  lower  bounds 
from  equation  (4.3).  The  R  factor  is  obtained  from  QRCP  (top  plot),  CARRQR-B  (middle  plot), 
and  CARRQR-F  (bottom  plot). 


Fig.  4.4.  Ratios  min^— i:n{  e 1 1 (^ri) 0^) [  |  }  /or  firs^  18  matrices  in  our  test  set,  where  n  is  the 
size  of  the  input  matrix.  The  R  factor  is  obtained  from  QRCP  (left  column),  CARRQR-B  (middle 
column),  and  CARRQR-F  (right  column). 


As  explained  earlier,  tournament  pivoting  used  in  CARRQR  does  not  guarantee 
that  the  R-values  decrease  monotonically,  while  QR  with  column  pivoting  does.  The 
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formula  (4.3)  should  be  modified  for  CARRQR  as 


1 

M 


<  —  <  ||F_1| 

O  i 


where  pi  is  the  i-th  largest  diagonal  of  the  R-factor  of  CARRQR,  that  is,  the  Tth 
value  of  the  sorted  R-values.  Since  the  CARRQR  factorizations  have  good  rank- 
revealing  properties  in  practice,  we  can  expect  that  (4.3)  still  holds  approximately 
for  CARRQR,  even  without  this  modification.  And  this  is  indeed  verified  by  the 
numerical  results  in  Figure  4,  where  the  ratios  R(i,i)/ai  are  bounded  by  the  bounds 
in  (4.3)  for  all  our  test  cases.  To  check  the  monotonicity  of  the  R-values,  Figure  4.5 
displays  the  minimum,  the  maximum,  and  the  median  of  the  rations  | R(i  +  l,i  + 
l)|/|f?(*,i)|,  where  i  =  1,. . .  ,n  —  1  and  n  is  the  size  of  the  input  matrix.  The  factor 
R  is  obtained  from  RRQR  with  column  pivoting  (top  plot),  CARRQR-B  (middle 
plot),  and  CARRQR-F  (bottom  plot).  It  can  be  seen  that  for  both  CARRQR-B  and 
CARRQR-F,  the  ratios  are  well  below  1  except  for  a  very  few  cases. 
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Fig.  4.5.  Ratios  of  succesive  R-values,  |R(i  +  1,  i  +  l)|/|R(i,  i)|(l  for  i  =  1, . . . ,  n  —  1  and  n  is 
the  size  of  the  matrix.  The  R  factor  is  obtained  from  QRCP  (top  plot),  CARRQR-B  (middle  plot), 
and  CARRQR-F  (bottom  plot). 


5.  Tournament  pivoting  for  other  factorizations.  In  this  section  we  de¬ 
scribe  very  briefly  how  the  tournament  pivoting  can  be  extended  to  other  factoriza¬ 
tions  that  require  some  form  of  pivoting. 

Cholesky  with  diagonal  pivoting.  Since  QR  with  column  pivoting  applied  to  A 
is  mathematically  equivalent  to  Cholesky  with  diagonal  pivoting  applied  to  AT A  (in 
exact  arithmetic),  the  same  general  techniques  as  for  RRQR  can  be  used  in  this 
context.  But  now  the  leaves  in  our  reduction  tree  are  b- by-b  diagonal  subblocks  of 
the  symmetric  positive-definite  matrix  H  to  which  we  want  to  apply  Cholesky  with 
pivoting,  and  at  each  internal  node  of  the  (binary)  reduction  tree  we  take  the  26-by- 
2 b  diagonal  submatrix  enclosing  the  two  b-by-b  matrices  from  the  child  nodes,  and 
perform  some  kind  of  Cholesky  with  diagonal  pivoting  on  this  2b-by-2b  matrix  to 
select  the  best  b-by-b  submatrix  to  pass  up  the  tree. 

Rank-revealing  decompositions  of  products/quotients  of  matrices.  The  goal  is  to 
compute  a  rank-revealing  QR-like  decomposition  of  an  arbitrary  product  P  =  Af1  ■ 
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A^1  ■  ■  ■  A^1  without  actually  multiplying  or  inverting  any  of  the  factors  A This 
problem  arises,  for  example,  in  algorithms  for  eigenproblems  and  the  SVD  that  either 
run  asymptotically  as  fast  as  fast  matrix  multiplication  (0(nu)  for  some  u>  <  3),  or 
minimize  communication. 

When  P  =  Ai,  the  communication  avoiding  RRQR  algorithm  discussed  in  this 
paper  can  be  used.  But  when  P  =  A ,  or  k  >  1  matrices  are  involved,  the  only 
solution  available  so  far  involves  randomization,  i.e.  computing  the  regular  QR  de¬ 
composition  of  PV  where  V  is  a  random  orthogonal/ unit  ary  matrix.  This  can  be 
obtained  by  computing  only  QR  factorizations  and  multiplying  by  orthogonal  ma¬ 
trices,  and  without  multiplying  or  inverting  other  matrices,  and  so  it  is  stable.  The 
resulting  factorization  is  of  the  form  P  =  QRV  where  V  is  random  orthogonal,  Q  is 
orthogonal,  and  R  is  represented  as  a  product  of  R  factors  and  their  inverses.  The 
useful  property  is  that  the  leading  columns  of  Q  do  indeed  span  the  desired  subspaces 
with  high  probability.  This  approach  is  used  for  designing  communication-avoiding 
eigendecompositions  and  singular  value  decompositions  [2]. 

GECP  -  LU  with  complete  pivoting.  An  approach  to  extend  tournament  pivoting 
to  GECP  is  to  perform  each  panel  factorization  as  following.  Tournament  pivoting 
uses  RRQR  to  pick  the  next  best  b  columns  to  use,  and  then  uses  TSLU  on  these 
columns  to  pick  their  best  b  rows.  Then  the  resulting  6-by-6  submatrix  is  permuted  to 
the  upper  left  corner  of  the  matrix,  and  b  steps  of  LU  with  no  pivoting  are  performed. 
Then  the  process  repeats  on  the  trailing  submatrix.  The  intuition  is  that  by  picking 
the  best  b  columns,  and  then  the  best  b  rows  from  these  columns,  we  are  in  effect 
picking  the  best  b-by-b  submatrix  overall,  which  is  then  permuted  to  the  upper  left 
corner.  The  communication  costs  are  clearly  minimal,  since  it  just  uses  previously 
studied  components. 

LDLt  factorization  with  pivoting.  The  challenge  here  is  preserving  symmetry.  A 
possible  usage  of  tournament  pivoting  is  the  following: 

1.  Use  the  approach  for  GECP  above  to  find  the  ’’best”  b-by-b  submatrix  S  of 
the  symmetric  matrix  A ,  whether  or  not  it  is  a  principal  submatrix  (i.e.  lies 
in  the  same  set  of  b  rows  and  b  columns  of  A).  If  S'  is  a  principal  submatrix, 
symmetrically  permute  it  to  lie  in  the  first  b  rows  and  columns  of  A ,  and  use 
it  to  perform  b  steps  of  symmetric  LU  without  pivoting;  otherwise  continue 
to  step  2. 

2.  Expand  S  to  be  a  smallest-possible  principal  submatrix  of  A ,  called  T.  The 
dimension  of  T  can  be  from  b-\- 1  up  to  26,  depending  on  how  much  S  overlaps 
the  diagonal  of  A. 

3.  Find  a  well-conditioned  principal  submatrix  W  of  T  of  dimension  d  where 
b  <  d  <  26,  permute  it  into  the  top  left  corner  of  A,  and  perform  d  steps  of 
symmetric  LU  as  above. 

The  conjecture  is  that  such  a  well-conditioned  principal  submatrix  W  of  T  must 
exist.  Here  is  a  sketch  of  an  algorithm  to  find  W .  The  6  largest  singular  values  of  T 
are  at  least  as  large  as  the  6  singular  values  of  S,  by  the  interlacing  property.  Write 
the  eigendecomposition  T  =  QAQT  =  [Qi,  Q2]diag(Ai,  A2)[Qi,  Q2Y  where  Ai  has 
the  large  eigenvalues,  and  A2  has  the  small  ones.  Do  GEPP  on  Q\  to  identify  the 
most  independent  rows,  and  choose  these  as  pivot  rows  determining  W .  The  above 
sketch  may  serve  to  prove  that  a  well-conditioned  principal  submatrix  W  exist,  under 
the  condition  that  there  is  a  big  enough  gap  in  the  eigenvalues  of  T.  But  we  may 
prefer  to  simply  do  more  conventional  LDLT  factorization  with  pivoting  to  find  it, 
instead  of  using  an  eigendecomposition  as  above. 
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Of  course  one  reason  for  using  LDLT  factorization  instead  of  LU,  is  that  it 
traditionally  takes  half  the  storage  and  half  the  flops.  It  is  not  clear  whether  we  can 
retain  either  of  these  advantages,  or  even  how  much  the  latter  one  still  matters. 

6.  Conclusions.  In  this  paper  we  introduce  CARRQR,  a  communication  opti¬ 
mal  rank  revealing  QR  factorization  based  on  tournament  pivoting.  CARRQR  does 
asymptotically  less  communication  than  the  classic  rank  revealing  QR  factorization 
based  on  column  pivoting,  at  the  cost  of  performing  a  factor  of  3  more  floating  point 
operations.  We  have  shown  through  extensive  numerical  experiments  on  challeng¬ 
ing  matrices  that  the  CARRQR  factorization  reveals  the  rank  similarly  to  the  QR 
factorization  with  column  pivoting,  and  provides  results  close  to  the  singular  values 
computed  with  the  highly  accurate  routine  DGESVJ.  Our  future  work  will  focus  on 
implementing  CARRQR  and  studying  its  performance  with  respect  to  current  im¬ 
plementations  of  QR  with  column  pivoting  available  in  LAPACK  and  ScaLAPACK. 
We  have  also  outlined  how  tournament  pivoting  extends  to  a  variety  of  other  pivoted 
matrix  factorizations,  as  Cholesky  with  diagonal  pivoting,  LU  with  complete  pivoting, 
or  LDLt  factorization  with  pivoting. 
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7.  Appendix  A.  We  present  additional  numerical  results  for  the  matrices  in 
our  test  set  comparing  the  diagonal  values  of  the  R  factor  obtained  after  QRCP  and 
CARRQR  with  the  singular  values  of  the  input  matrices. 


Algorithm  2  Matlab  code  for  generating  the  devil’s  stairs  matrix 
1:  Length  =  20; 

2:  s  =  zeros(n,l); 

3:  Nst  =  floor(n/Length); 

4:  for  i  =  1  :  Nst  do 

5:  s(l+Length*(i-l):Length*i)  =  -0.6*(i-l); 

6:  end  for 

7:  s(Length*Nst:end)  =  -0.6*(Nst-l); 

8:  s  =  10.  A  s; 

9:  A  =  orth(rand(n))  *  diag(s)  *  orth(randn(n)); 


8.  Appendix  B.  We  present  detailed  performance  counts  for  the  communica¬ 
tion  avoiding  RRQR  algorithms  introduced  in  section  3.  In  all  the  counts,  we  ignore 
lower  order  terms. 

8.1.  Performance  counts  of  parallel  QR  with  tournament  pivoting.  We 

consider  that  the  input  matrix  is  distributed  block  cyclically  over  a  grid  of  P  =  Pr  x  Pc 
processors.  The  block  size  is  b.  We  compute  the  additional  flops  introduced  in  the 
algorithm  due  to  tournament  pivoting.  Hence  the  total  cost  of  the  algorithm  is  the 
cost  of  performing  tournament  pivoting  plus  the  cost  of  performing  CAQR.  In  the 
analysis,  we  give  in  detail  the  counts  at  each  step  j  of  the  block  algorithm,  and 
m-b  =  m  —  jb,  rib  =  n  —  jb- 

We  consider  first  tournament  pivoting  with  bp  =  nb/Pc,  that  is  the  matrices  at 
the  leaves  of  the  reduction  tree  have  rib /Pc  columns  and  as  many  rows  as  the  trailing 
matrix.  At  the  upper  levels  of  the  reduction  tree,  the  matrices  have  2b  columns. 
Tournament  pivoting  performs  the  following  operations. 

•  Each  column  grid  of  Pr  processors  performs  TSQR  factorization  of  a  matrix 
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(a)  Break- 1  -  R- values 


(e)  Deriv2  -  R-values 


(g)  Exponent  -  R-values 


(i)  Foxgood  -  R-values 


(b)  Break- 1  -  ratios 


(d)  Break-9  -  ratios 


(f)  Deriv2  -  ratios 


(h)  Exponent  -  ratios 


(j)  Foxgood  -  ratios 


Fig.  7.1.  R-values  of  QR  with  column  pivoting  (QRCP),  binary-tree-based  CARRQR 
(CARRQR-B),  and  flat- tree-based  CARRQR  (CARRQR-F)  and  singular  values  for  matrices  in 
our  test  set. 
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(c)  Gravity  -  R-values 

(d)  Gravity  -  ratios 

(e)  HC  -  R- values  (f)  HC  -  ratios 


(i)  Phillips  -  R- values  (j)  Phillips  -  ratios 

Fig.  7.2.  R-values  of  QR  with  column  pivoting  (QRCP),  binary-tree-based  CARRQR 
(CARRQR-B),  and  flat- tree-based  CARRQR  (CARRQR-F)  and  singular  values  for  matrices  in 
our  test  set. 
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Fig.  7.3.  R-values  of  QR  with  column  pivoting  (QRCP),  binary-tree-based  CARRQR 
(CARRQR-B),  and  flat- tree- based  CARRQR  (CARRQR-F)  and  singular  values  for  matrices  in 
our  test  set. 
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(a)  Ursell  -  R- values 

10"  - j - ! - i - e 


(c)  Wing  -  R- values 


-R™cp(|'l|'°(|) 


(b)  Ursell  -  ratios 


(d)  Wing  -  ratios 


Fig.  7.4.  R-values  of  QR  with  column  pivoting  (QRCP),  binary-tree-based  CARRQR 
(CARRQR-B),  and  flat-tree-based  CARRQR  (CARRQR-F)  and  singular  values  for  matrices  in 
our  test  set. 
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•  Each  processor  performs  locally  QR  with  column  pivoting  of  a  matrix  of  size 
rib/Pc  x  rib/Pc  to  select  b  local  pivot  columns. 
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•  Reduce  operation:  for  each  level  of  the  reduction  tree  of  depth  log2Pc  perform: 
—  Processors  in  the  same  row  grid  exchange  their  pivot  columns 
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—  Processors  in  the  same  column  grid  perform  TSQR  factorization  of  ma- 
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trices  of  size  mb  x  2b. 


%/b 


II  si  ST' /2mh(26)2  2(2 6)3  \  4(2mnb  -  n2b)  16 nb2 

# flops  =  Y  (  - „ - 1 - Y  log2  Pr  )  log2  Pc  =  - ^ - -  log2  Pc  H - 5—  log2  P,  log2  Pc 


i= 1 

n/b 


V  Pr 

4  b2 


Pr 


#words  =  Y/  log2  (Pr)  log2  (Pc)  =  2nb log2  (Pr)  log2  (Pc) 


3= 1 

Tl 

# messages  =  -  log2  (Pr)  log2  (Pc) 


Processors  perform  QR  with  column  pivoting  factorization  of  the  matrix 
of  size  2b  x  2b  of  the  R  factor  obtained  from  TSQR  at  each  node  of  the 
reduction  tree  in  the  current  level. 


n/b  4(2 b)3 

# flops  =  Y  — loS2  pc  ~  11  nb2  log2  Pc 
3= 1 


The  b  pivot  columns  are  permuted  in  the  diagonal  positions, 


n/b 


v-^  mbb  mn  —  n2/ 2 
Awards  =  ^  —  =  - - - 


•  1  pr 

3= 1 

Tl 

//messages  =  —  log2  Pc 


Hence  the  additional  cost  introduced  by  tournament  pivoting  is: 


,,  ,,  2mn2  —  n3/4  n  1  n3  n  „  „ 

#  flops  =  3 - p - tjr  +  6  -p?Wf°^P'  +  2) 
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H - p - log2  Pc  H - log2  Pr  log2  Pc  +  11  nb2  log2  Pc 


1  n2  n  ,  ,„N  .  mn  —  n2/ 2 


— log2(Pr)  + 


Pr 


(log2  Pc  +  1)  +  2n6 log2  (Pr)  log2  (Pc 


# messages  =  -  (log2  (Pr)  +  2  log2  (Pc)  +  log2  (Pr)  log2  (Pc)) 


We  analyze  now  the  algorithm  that  uses  tournament  pivoting  on  column  blocks 
of  size  2b  at  each  step  of  the  factorization.  At  each  step  of  the  block  algorithm, 
tournament  pivoting  performs  the  following  operations: 

•  At  the  leaves  of  the  reduction  tree,  the  processors  in  each  column  grid  perform 
TSQR  factorization  of  matrices  of  size  mb  x  26,  followed  by  the  QR  with 
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column  pivoting  of  the  R  factor  of  size  2b  x  2b,  to  identify  b  pivot  columns. 
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log2  (Pr)  +  y 
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=  loS2  (pr) 
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#' messages  =  ^  yy  log2  (Pr)  =  yy  log2  (Pr) 
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•  For  each  level  of  the  reduction  tree  of  depth  log2  (fj)  do 

—  Processors  in  the  same  row  grid  exchange  their  portions  of  the  previously 
selected  b  pivot  columns.  This  communication  is  performed  only  in  the 
last  log2  (Pc)  levels  of  the  reduction  tree,  since  in  the  previous  levels  the 
columns  reside  on  a  same  processor. 


n/b 

#words  <  ^2  7T61og2  (pc) 
i= i  t r 
Tl 

# messages  <  —  log2  (Pc) 


mn  —  n2/ 2 
Pr 


log2  (Pc) 


—  Processors  in  the  same  column  grid  perform  TSQR  factorization  of  ma¬ 
trices  of  size  mb  x  2b,  followed  by  the  QR  with  column  pivoting  of  the 
R  factor  of  size  2b  x  2b,  to  identify  b  pivot  columns.  The  cost  at  each 
level  is  the  same  as  the  cost  at  the  leaves  of  the  reduction  tree. 
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The  b  pivot  columns  are  permuted  in  the  diagonal  positions. 
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Hence  the  total  cost  of  tournament  pivoting  is: 
4  mn2  —  4?r3/3  8  n2b 
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P  3  Pc 
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Pr 


log2  (pc)  +  y  nb2  log2  (Pr)  log2  ( Pc )  +  —nb2  log2  (Pc) 
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Pr 


(log2  (Pc)  +  1) 


fl 

# messages  =  log2  (Pr)  +  -  log2  (Pc)(log2  (Pr)  +  1) 

Performance  with  optimal  layout.  We  analyze  the  two  versions  of  tournament 
pivoting  when  using  an  optimal  layout.  As  in  [10],  we  use  the  following  values: 


Pr  = 


mP 


,  Pc  = 


nP  ,  mn 

and  b  =  ti  ■  *  - , 

m  V  P 


(8.1) 


These  values  ensure  that  CAQR  is  communication  optimal  without  increasing  the 
flops  count.  With  these  values,  for  the  tournament  pivoting  with  bx  =  26,  we  obtain: 


4ffl°ps~ 


Amn2  —  4n3/3 
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P 


P 

8 B  [  ^  log2  (Pr)  +  log2  (Pc)  +  y  )  +  32 B2  (  ^  log2  (Pr)  log2  (Pc)  +  ^  log2  (Pc) 


By  choosing  B  =  8_1  log-1  (Pr)  log-1  (Pc),  the  first  term  in  the  flops  dominates 
the  redundant  computation  due  to  tournament  pivoting.  This  is  a  term  we  cannot 
decrease.  The  counts  become: 

4 mn2  -  4?r3/3  2 

flops  «  - — - 1-  cmn  ,  c  <  1 
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We  choose  the  same  optimal  layout  for  tournament  pivoting  with  =  n/Pc.  We 
obtain: 


#flops= 


5mn2  —  n3 


mn 

P 


6PB 


8 B  log2  (Pc)  +  32P2  -  log2  (Pr)  log2  (Pc)  +  -  log2  (Pc) 


The  communication  becomes: 
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8.2.  Performance  counts  of  sequential  QR  with  tournament  pivoting. 

In  this  section  we  describe  a  performance  model  for  sequential  QR  with  tournament 
pivoting,  which  is  analogous  to  the  current  Algorithm  1  (Parallel  QR  with  tournament 
pivoting).  Consider  that  the  matrix  A  to  be  factored  is  of  size  in  x  n,  and  that  the 
sequential  algorithm  factors  the  input  matrix  by  traversing  panels  of  6  columns.  First 
we  count  flops ,  ffwords  and  ff messages  ignoring  the  cost  of  permuting  columns, 
then  the  cost  of  permuting  columns  is  added. 

1.  # flops:  the  total,  independent  of  the  shape  of  the  tree  (as  long  as  we  only 
combine  two  b-column  panels  at  a  time)  cost  of  the  tournament  pivoting 
strategy  is  4mn2  —  (4/3)n3,  so  twice  the  rest  of  QR.  Quick  derivation:  A 
single  QR  factorization  of  an  r  x  2b  submatrix  costs  2r(26)2  —  (2/3)(26)3  = 

8 rb2  —  (16/3)63.  The  subsequent  RRQR  on  the  2b  x  2b  R  matrix  costs  another 
0(6 3),  so  one  step  of  the  tournament  costs  8 rb2  +  0(b3)  flops.  Whether  we 
use  a  flat,  binary  or  other  kind  of  tree,  if  we  started  with  c  columns,  and  so 
(c/6)  panels  of  b  columns  each,  the  cost  of  the  tournament  is  (c/6  —  1)  times 
as  much,  or  8rc6  +  0(c62).  Going  back  to  the  original  mxn  matrix,  there  are 
n/b  —  1  tournaments  with  a  total  cost  of 

^  flops  =  ^  [8(to— ib)(n— ib)b+0((n— ib)b2)]  =  4mn2  — (4/3)n3+0(mn6). 

i—0:n/b—  1 

2.  ffwords :  If  we  choose  6  =  ©(M1/2)  where  M  is  cache  size,  so  that  we  can 
do  an  r  x  26  QR  decomposition  with  a  flat  tree  and  read  the  panel  just  once, 
the  Swords  is  2rb  for  a  single  r  x  26  QR  factorization,  2 rc  for  a  tournament 
on  an  r  x  c  matrix,  and  altogether 

n/b—  1 

Jfwords  =  E  2(m—ib)(n—ib)  =  mn2 /6— (l/3)n3 /b+0(mn)  =  0(mn2  /M1^2) 

i= 0 

which  is  the  desired  lower  bound. 

3.  ff  messages :  If  we  use  a  blocked  data  structure  where  each  6x6  block  costs 
1  message  to  access,  then  one  r  x  26  QR  decomposition  with  a  flat  tree  costs 
(2r6/62)  =  2r/b  messages,  so  we  just  divide  #words  by  62  =  0(M)  to  get 
# messages  =  mn2 /b3  —  (l/3)n3/63  +  0(mn/b2)  =  &(mn2 /M3/2)  which  is 
the  desired  lower  bound. 

Now  we  add  the  cost  of  permuting  columns.  This  occurs  both  during  the  tourna¬ 
ment,  to  build  the  r  x  26  matrix  analyzed  at  each  step,  and  after  the  6  columns  are 
selected  by  the  tournament  and  must  be  moved  to  the  front  of  the  matrix.  One  can 
imagine  interleaving  these  two  operations,  or  doing  them  separately.  For  simplicity,  I 
will  describe  the  costs  of  doing  them  separately. 

The  basic  permutation  operation  during  the  tournament  is  to  take  the  list  of  6 
selected  columns  of  the  26  analyzed  at  a  tournament  stage,  and  build  a  new  packed 
r  x  6  matrix  from  them.  We  assume  36  x  6  blocks  fit  in  fast  memory  at  the  same  time: 
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for  i  =  1  to  r/b 

read  i-th  b-by-b  blocks  of  two  input  r-by-b  matrices 
select  b  columns  from  these  2b  columns  of  length,  b 
pack  them  into  a  b-by-b  block 
write  block  to  slow  memory 
endf or 

This  costs  #words  =  3 rb  and  # messages  =  3 r/b.  An  entire  tournament  on  an 
r  x  c  matrix  costs  c/b  times  as  much,  independent  of  the  shape  of  the  reduction  tree, 
or  //words  =  3 rc  and  # messages  =  3rc/b2.  Finally,  the  cost  of  all  the  tournaments 
on  an  m  x  n  matrix  is 

n/b—1 

//words  =  E  3(m  —  ib)(n  —  ib)  =  (3/2  )mn2/b  —  (1/2  )n3 /b  +  0(mri)  =  0(rnn2/M1//2) 

i- 0 

# messages  =  (3/2  )mn2 /b3  —  (1/2  )n3/&3  +  0(mn/b2)  =  0 
as  desired. 

Now  suppose  we  have  completed  a  tournament  and  need  to  go  back  to  the  original 
matrix  and  permute  its  columns  correspondingly.  Note  that  we  must  permute  not  just 
within  the  trailing  (to  —  ib)  x  (n  —  ib)  submatrix,  but  also  the  trailing  to  x  (n  —  ib) 
submatrix,  i.e.  the  upper  rows  of  the  R  factor  must  also  be  permuted.  We  organize 
the  permutation  in  two  phases: 

1.  a  collection  of  transpositions,  each  one  swapping  a  desired  column  from  a 
later  panel  with  an  undesired  column  from  the  current  panel 

2.  any  subsequent  reordering  within  the  current  panel. 

So  given  an  mx  b  panel  at  the  left  of  an  mx  c  submatrix,  the  algorithm  is  as  follows: 

for  i  =  1  to  m/b 

read  i-th  b-by-b  block  of  first  m-by-b  panel  (at  left) 
for  j  =  2  to  c/b 

if  any  transpositions  involve  j-th  m/b  p 

endf or 
endf or 


mn2  \ 
M3/2  ) 
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